On 01/11/2014 06:49 PM, Christian Jauvin wrote:
> Another take on my previous question is this other question:
>
> Is fitting a LabelEncoder on the *entire* dataset (instead of only on
> the training set) an equivalent "sin" (i.e. a common ML mistake) as
> say doing so with a Scaler or some other preprocessing technique?
>
> If the answer is yes (which is what I assume because it can be
> considered I guess as a form of data leakage), what is the standard
> way to solve the issue of test values (for a categorical variable)
> that have never been encountered in the training set?
>
Sorry for the late reply. I actually had the same problem recently.

Fitting on the whole dataset is not a sin here. The problem why you 
can't transform new values is because there is no value they map to.

Now, with respect to sinning: there is really no additional information 
in the labels that could be used during learning. The only case when 
that could
be important is if the labels have some meaningful labeling and it is 
important to know the position of the labels with respect to the 
previous ones.
But that is somewhat of a weird thing to encode here anyhow.

I am not entirely sure why we currently have this restriction (Lars, do 
you know by any chance?)
I think we should have the option of adding new labels by "counting on". 
I don't see the harm in that.

Cheers,
Andy

------------------------------------------------------------------------------
WatchGuard Dimension instantly turns raw network data into actionable 
security intelligence. It gives you real-time visual feedback on key
security issues and trends.  Skip the complicated setup - simply import
a virtual appliance and go from zero to informed in seconds.
http://pubads.g.doubleclick.net/gampad/clk?id=123612991&iu=/4140/ostg.clktrk
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to