Another take on my previous question is this other question:

Is fitting a LabelEncoder on the *entire* dataset (instead of only on
the training set) an equivalent "sin" (i.e. a common ML mistake) as
say doing so with a Scaler or some other preprocessing technique?

If the answer is yes (which is what I assume because it can be
considered I guess as a form of data leakage), what is the standard
way to solve the issue of test values (for a categorical variable)
that have never been encountered in the training set?


On 9 January 2014 15:21, Christian Jauvin <cjau...@gmail.com> wrote:
> Hi,
>
> If a LabelEncoder has been fitted on a training set, it might break if it
> encounters new values when used on a test set.
>
> The only solution I could come up with for this is to map everything new in
> the test set (i.e. not belonging to any existing class) to "<unknown>", and
> then explicitly add a corresponding class to the LabelEncoder afterward:
>
> # train and test are pandas.DataFrame's and c is whatever column
> le = LabelEncoder()
> train[c] = le.fit_transform(train[c])
> test[c] = test[c].map(lambda s: '<unknown>' if s not in le.classes_ else s)
> le.classes_ = np.append(le.classes_, '<unknown>')
> test[c] = le.transform(test[c])
>
> This works, but is there a better solution?
>

------------------------------------------------------------------------------
CenturyLink Cloud: The Leader in Enterprise Cloud Services.
Learn Why More Businesses Are Choosing CenturyLink Cloud For
Critical Workloads, Development Environments & Everything In Between.
Get a Quote or Start a Free Trial Today. 
http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to