Integer coding will indeed make the DT assume an arbitrary ordering while one-hot encoding does not force the tree model to make that assumption.
However in practice when the depth of the trees is not too limited (or if you use a large enough ensemble of trees), the model will have enough flexibility to introduce as many splits as necessary to isolate individual categories in the integer and therefore the arbitrary ordering assumption is not a problem. On the other hand using one-hot encoding can introduce a detrimental inductive bias on random forests: random forest uses uniform random feature sampling when deciding which feature to split on (e.g. pick the best split out of 25% of the features selected at random). Let's consider the following example: assume you have an heterogeneously typed dataset with 99 numeric features and 1 categorical feature with categorical cardinality 1000 (1000 possible values for that features): - the RF will have one chance in 100 to pick each feature (categorical or numerical) as a candidate for the next split when using integer coding, - the RF will have 0.1% chance of picking each numerical feature and 99% chance to select a candidate feature split on a category of the unique categorical feature when using one-hot encoding. The inductive bias of one-encoding on RFs can therefore completely break the feature balancing. The feature encoding will also impact the inductive bias with respect the importance of the depth of the trees, even when feature splits are selected fully deterministically. Finally one-hot encoding features with large categorical cardinalities will be much slower then when using naive integer coding. TL;DNR: naive theoretical analysis based only on the ordering assumption can be misleading. Inductive biases of each feature encoding are more complex to evaluate. Use cross-validation to decide which is the best on your problem. Don't ignore computational considerations (CPU and memory usage). -- Olivier _______________________________________________ scikit-learn mailing list scikit-learn@python.org https://mail.python.org/mailman/listinfo/scikit-learn