Another take on my previous question is this other question: Is fitting a LabelEncoder on the *entire* dataset (instead of only on the training set) an equivalent "sin" (i.e. a common ML mistake) as say doing so with a Scaler or some other preprocessing technique?
If the answer is yes (which is what I assume because it can be considered I guess as a form of data leakage), what is the standard way to solve the issue of test values (for a categorical variable) that have never been encountered in the training set? On 9 January 2014 15:21, Christian Jauvin <cjau...@gmail.com> wrote: > Hi, > > If a LabelEncoder has been fitted on a training set, it might break if it > encounters new values when used on a test set. > > The only solution I could come up with for this is to map everything new in > the test set (i.e. not belonging to any existing class) to "<unknown>", and > then explicitly add a corresponding class to the LabelEncoder afterward: > > # train and test are pandas.DataFrame's and c is whatever column > le = LabelEncoder() > train[c] = le.fit_transform(train[c]) > test[c] = test[c].map(lambda s: '<unknown>' if s not in le.classes_ else s) > le.classes_ = np.append(le.classes_, '<unknown>') > test[c] = le.transform(test[c]) > > This works, but is there a better solution? > ------------------------------------------------------------------------------ CenturyLink Cloud: The Leader in Enterprise Cloud Services. Learn Why More Businesses Are Choosing CenturyLink Cloud For Critical Workloads, Development Environments & Everything In Between. Get a Quote or Start a Free Trial Today. http://pubads.g.doubleclick.net/gampad/clk?id=119420431&iu=/4140/ostg.clktrk _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general