@sebastian: thanks. Indeed, I am aware of this problem. I developed something here: https://gist.github.com/geoHeil/5caff5236b4850d673b2c9b0799dc2ce but realized that the performance of prediction is pretty lame when there are around 100-150 columns used as the input. Do you have some ideas how to speed this up?
Regards, Georg Joel Nothman <joel.noth...@gmail.com> schrieb am So., 6. Aug. 2017 um 00:49 Uhr: > We are working on CategoricalEncoder in > https://github.com/scikit-learn/scikit-learn/pull/9151 to help users more > with this kind of thing. Feedback and testing is welcome. > > On 6 August 2017 at 02:13, Sebastian Raschka <se.rasc...@gmail.com> wrote: > >> Hi, Georg, >> >> I bring this up every time here on the mailing list :), and you probably >> aware of this issue, but it makes a difference whether your categorical >> data is nominal or ordinal. For instance if you have an ordinal variable >> like with values like {small, medium, large} you probably want to encode it >> as {1, 2, 3} or {1, 20, 100} or whatever is appropriate based on your >> domain knowledge regarding the variable. If you have sth like {blue, red, >> green} it may make more sense to do a one-hot encoding so that the >> classifier doesn't assume a relationship between the variables like blue > >> red > green or sth like that. >> >> Now, the DictVectorizer and OneHotEncoder are both doing one hot >> encoding. The LabelEncoder does convert a variable to integer values, but >> if you have sth like {small, medium, large}, it wouldn't know the order (if >> that's an ordinal variable) and it would just assign arbitrary integers in >> increasing order. Thus, if you are dealing ordinal variables, there's no >> way around doing this manually; for example you could create mapping >> dictionaries for that (most conveniently done in pandas). >> >> Best, >> Sebastian >> >> > On Aug 5, 2017, at 5:10 AM, Georg Heiler <georg.kf.hei...@gmail.com> >> wrote: >> > >> > Hi, >> > >> > the LabelEncooder is only meant for a single column i.e. target >> variable. Is the DictVectorizeer or a manual chaining of multiple >> LabelEncoders (one per categorical column) the desired way to get values >> which can be fed into a subsequent classifier? >> > >> > Is there some way I have overlooked which works better and possibly >> also can handle unseen values by applying most frequent imputation? >> > >> > regards, >> > Georg >> > _______________________________________________ >> > scikit-learn mailing list >> > scikit-learn@python.org >> > https://mail.python.org/mailman/listinfo/scikit-learn >> >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn@python.org >> https://mail.python.org/mailman/listinfo/scikit-learn >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn >
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn