Hi,

I have noticed a change with the LabelBinarizer between version 0.15
and those before.

Prior 0.15, this worked:

>>> lb = LabelBinarizer()
>>> lb.fit_transform(['a', 'b', 'c'])
array([[1, 0, 0],
       [0, 1, 0],
       [0, 0, 1]])
>>> lb.transform(['a', 'd', 'e'])
array([[1, 0, 0],
       [0, 0, 0],
       [0, 0, 0]])

Note that both values 'd' and 'e', having never been "seen" while the
LabelBinarizer was being trained, were simply mapped to [0, 0, 0] by
"transform", which I interpreted and used as an "unknown" class,
useful in cases where your test data can contain values which could
not be known in advance, i.e. at the time of training (and also, to
avoid "data leakage" while doing cross-validation).

With 0.15, the same code now gives the error:

[...]
ValueError: classes ['a' 'b' 'c'] missmatch with the labels ['a' 'd'
'e']found in the data

I wrote about this question a couple of months ago, in regard of a
similar issue with the LabelEncoder:

http://sourceforge.net/p/scikit-learn/mailman/message/31827616/

So if my understanding of this mechanism is correct (as well as my
assumptions about the way it is/should be used), would it make sense
to add something like a "map_unknowns_to_single_class" extra parameter
to all the preprocessing encoders, so that this behaviour can be at
least implemented optionally?

Thanks,

Christian

------------------------------------------------------------------------------
Want fast and easy access to all the code in your enterprise? Index and
search up to 200,000 lines of code with a free copy of Black Duck
Code Sight - the same software that powers the world's largest code
search on Ohloh, the Black Duck Open Hub! Try it now.
http://p.sf.net/sfu/bds
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to