Thanks very much for the thorough answer. I didn't think about the inductive bias issue with my forests. I'll evaluate both set of coding for my unordered categoricals.
Andrew <~~~~~~~~~~~~~~~~~~~~~~~~~~~> J. Andrew Howe, PhD www.andrewhowe.com http://www.linkedin.com/in/ahowe42 https://www.researchgate.net/profile/John_Howe12/ I live to learn, so I can learn to live. - me <~~~~~~~~~~~~~~~~~~~~~~~~~~~> On Wed, Mar 29, 2017 at 12:56 PM, Olivier Grisel <olivier.gri...@ensta.org> wrote: > Integer coding will indeed make the DT assume an arbitrary ordering > while one-hot encoding does not force the tree model to make that > assumption. > > However in practice when the depth of the trees is not too limited (or > if you use a large enough ensemble of trees), the model will have > enough flexibility to introduce as many splits as necessary to isolate > individual categories in the integer and therefore the arbitrary > ordering assumption is not a problem. > > On the other hand using one-hot encoding can introduce a detrimental > inductive bias on random forests: random forest uses uniform random > feature sampling when deciding which feature to split on (e.g. pick > the best split out of 25% of the features selected at random). > > Let's consider the following example: assume you have an > heterogeneously typed dataset with 99 numeric features and 1 > categorical feature with categorical cardinality 1000 (1000 possible > values for that features): > > - the RF will have one chance in 100 to pick each feature (categorical > or numerical) as a candidate for the next split when using integer > coding, > - the RF will have 0.1% chance of picking each numerical feature and > 99% chance to select a candidate feature split on a category of the > unique categorical feature when using one-hot encoding. > > The inductive bias of one-encoding on RFs can therefore completely > break the feature balancing. The feature encoding will also impact the > inductive bias with respect the importance of the depth of the trees, > even when feature splits are selected fully deterministically. > > Finally one-hot encoding features with large categorical cardinalities > will be much slower then when using naive integer coding. > > TL;DNR: naive theoretical analysis based only on the ordering > assumption can be misleading. Inductive biases of each feature > encoding are more complex to evaluate. Use cross-validation to decide > which is the best on your problem. Don't ignore computational > considerations (CPU and memory usage). > > -- > Olivier > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn >
_______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn