Hi, I have noticed a change with the LabelBinarizer between version 0.15 and those before.
Prior 0.15, this worked: >>> lb = LabelBinarizer() >>> lb.fit_transform(['a', 'b', 'c']) array([[1, 0, 0], [0, 1, 0], [0, 0, 1]]) >>> lb.transform(['a', 'd', 'e']) array([[1, 0, 0], [0, 0, 0], [0, 0, 0]]) Note that both values 'd' and 'e', having never been "seen" while the LabelBinarizer was being trained, were simply mapped to [0, 0, 0] by "transform", which I interpreted and used as an "unknown" class, useful in cases where your test data can contain values which could not be known in advance, i.e. at the time of training (and also, to avoid "data leakage" while doing cross-validation). With 0.15, the same code now gives the error: [...] ValueError: classes ['a' 'b' 'c'] missmatch with the labels ['a' 'd' 'e']found in the data I wrote about this question a couple of months ago, in regard of a similar issue with the LabelEncoder: http://sourceforge.net/p/scikit-learn/mailman/message/31827616/ So if my understanding of this mechanism is correct (as well as my assumptions about the way it is/should be used), would it make sense to add something like a "map_unknowns_to_single_class" extra parameter to all the preprocessing encoders, so that this behaviour can be at least implemented optionally? Thanks, Christian ------------------------------------------------------------------------------ Want fast and easy access to all the code in your enterprise? Index and search up to 200,000 lines of code with a free copy of Black Duck Code Sight - the same software that powers the world's largest code search on Ohloh, the Black Duck Open Hub! Try it now. http://p.sf.net/sfu/bds _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general