IMO CART can handle categorical features just as good as CITrees, as long as we slightly change sklearn's implementation...
-- Julio > El 29 mar 2017, a las 15:30, Andreas Mueller <t3k...@gmail.com> escribió: > > I'd argue that's why we should implement conditional inference trees ;) > >> On 03/29/2017 05:56 AM, Olivier Grisel wrote: >> Integer coding will indeed make the DT assume an arbitrary ordering >> while one-hot encoding does not force the tree model to make that >> assumption. >> >> However in practice when the depth of the trees is not too limited (or >> if you use a large enough ensemble of trees), the model will have >> enough flexibility to introduce as many splits as necessary to isolate >> individual categories in the integer and therefore the arbitrary >> ordering assumption is not a problem. >> >> On the other hand using one-hot encoding can introduce a detrimental >> inductive bias on random forests: random forest uses uniform random >> feature sampling when deciding which feature to split on (e.g. pick >> the best split out of 25% of the features selected at random). >> >> Let's consider the following example: assume you have an >> heterogeneously typed dataset with 99 numeric features and 1 >> categorical feature with categorical cardinality 1000 (1000 possible >> values for that features): >> >> - the RF will have one chance in 100 to pick each feature (categorical >> or numerical) as a candidate for the next split when using integer >> coding, >> - the RF will have 0.1% chance of picking each numerical feature and >> 99% chance to select a candidate feature split on a category of the >> unique categorical feature when using one-hot encoding. >> >> The inductive bias of one-encoding on RFs can therefore completely >> break the feature balancing. The feature encoding will also impact the >> inductive bias with respect the importance of the depth of the trees, >> even when feature splits are selected fully deterministically. >> >> Finally one-hot encoding features with large categorical cardinalities >> will be much slower then when using naive integer coding. >> >> TL;DNR: naive theoretical analysis based only on the ordering >> assumption can be misleading. Inductive biases of each feature >> encoding are more complex to evaluate. Use cross-validation to decide >> which is the best on your problem. Don't ignore computational >> considerations (CPU and memory usage). >> > > _______________________________________________ > scikit-learn mailing list > scikit-learn@python.org > https://mail.python.org/mailman/listinfo/scikit-learn _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn