There is https://github.com/scikit-learn/scikit-learn/pull/4899 .
It looks like it is waiting for review? Raphael On 29 March 2017 at 11:50, federico vaggi <vaggi.feder...@gmail.com> wrote: > That's a really good point. Do you know of any systematic studies about the > two different encodings? > > Finally: wasn't there a PR for RF to accept categorical variables as inputs? > > On Wed, 29 Mar 2017 at 11:57, Olivier Grisel <olivier.gri...@ensta.org> > wrote: >> >> Integer coding will indeed make the DT assume an arbitrary ordering >> while one-hot encoding does not force the tree model to make that >> assumption. >> >> However in practice when the depth of the trees is not too limited (or >> if you use a large enough ensemble of trees), the model will have >> enough flexibility to introduce as many splits as necessary to isolate >> individual categories in the integer and therefore the arbitrary >> ordering assumption is not a problem. >> >> On the other hand using one-hot encoding can introduce a detrimental >> inductive bias on random forests: random forest uses uniform random >> feature sampling when deciding which feature to split on (e.g. pick >> the best split out of 25% of the features selected at random). >> >> Let's consider the following example: assume you have an >> heterogeneously typed dataset with 99 numeric features and 1 >> categorical feature with categorical cardinality 1000 (1000 possible >> values for that features): >> >> - the RF will have one chance in 100 to pick each feature (categorical >> or numerical) as a candidate for the next split when using integer >> coding, >> - the RF will have 0.1% chance of picking each numerical feature and >> 99% chance to select a candidate feature split on a category of the >> unique categorical feature when using one-hot encoding. >> >> The inductive bias of one-encoding on RFs can therefore completely >> break the feature balancing. The feature encoding will also impact the >> inductive bias with respect the importance of the depth of the trees, >> even when feature splits are selected fully deterministically. >> >> Finally one-hot encoding features with large categorical cardinalities >> will be much slower then when using naive integer coding. >> >> TL;DNR: naive theoretical analysis based only on the ordering >> assumption can be misleading. Inductive biases of each feature >> encoding are more complex to evaluate. Use cross-validation to decide >> which is the best on your problem. Don't ignore computational >> considerations (CPU and memory usage). >> >> -- >> Olivier >> _______________________________________________ >> scikit-learn mailing list >> scikit-learn@python.org >> https://mail.python.org/mailman/listinfo/scikit-learn > > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn > _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn