We are working on CategoricalEncoder in https://github.com/scikit-learn/scikit-learn/pull/9151 to help users more with this kind of thing. Feedback and testing is welcome.
On 6 August 2017 at 02:13, Sebastian Raschka <se.rasc...@gmail.com> wrote: > Hi, Georg, > > I bring this up every time here on the mailing list :), and you probably > aware of this issue, but it makes a difference whether your categorical > data is nominal or ordinal. For instance if you have an ordinal variable > like with values like {small, medium, large} you probably want to encode it > as {1, 2, 3} or {1, 20, 100} or whatever is appropriate based on your > domain knowledge regarding the variable. If you have sth like {blue, red, > green} it may make more sense to do a one-hot encoding so that the > classifier doesn't assume a relationship between the variables like blue > > red > green or sth like that. > > Now, the DictVectorizer and OneHotEncoder are both doing one hot encoding. > The LabelEncoder does convert a variable to integer values, but if you have > sth like {small, medium, large}, it wouldn't know the order (if that's an > ordinal variable) and it would just assign arbitrary integers in increasing > order. Thus, if you are dealing ordinal variables, there's no way around > doing this manually; for example you could create mapping dictionaries for > that (most conveniently done in pandas). > > Best, > Sebastian > > > On Aug 5, 2017, at 5:10 AM, Georg Heiler <georg.kf.hei...@gmail.com> > wrote: > > > > Hi, > > > > the LabelEncooder is only meant for a single column i.e. target > variable. Is the DictVectorizeer or a manual chaining of multiple > LabelEncoders (one per categorical column) the desired way to get values > which can be fed into a subsequent classifier? > > > > Is there some way I have overlooked which works better and possibly also > can handle unseen values by applying most frequent imputation? > > > > regards, > > Georg > > _______________________________________________ > > scikit-learn mailing list > > scikit-learn@python.org > > https://mail.python.org/mailman/listinfo/scikit-learn > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn >
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn