On 01/11/2014 06:49 PM, Christian Jauvin wrote: > Another take on my previous question is this other question: > > Is fitting a LabelEncoder on the *entire* dataset (instead of only on > the training set) an equivalent "sin" (i.e. a common ML mistake) as > say doing so with a Scaler or some other preprocessing technique? > > If the answer is yes (which is what I assume because it can be > considered I guess as a form of data leakage), what is the standard > way to solve the issue of test values (for a categorical variable) > that have never been encountered in the training set? > Sorry for the late reply. I actually had the same problem recently.
Fitting on the whole dataset is not a sin here. The problem why you can't transform new values is because there is no value they map to. Now, with respect to sinning: there is really no additional information in the labels that could be used during learning. The only case when that could be important is if the labels have some meaningful labeling and it is important to know the position of the labels with respect to the previous ones. But that is somewhat of a weird thing to encode here anyhow. I am not entirely sure why we currently have this restriction (Lars, do you know by any chance?) I think we should have the option of adding new labels by "counting on". I don't see the harm in that. Cheers, Andy ------------------------------------------------------------------------------ WatchGuard Dimension instantly turns raw network data into actionable security intelligence. It gives you real-time visual feedback on key security issues and trends. Skip the complicated setup - simply import a virtual appliance and go from zero to informed in seconds. http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general