Hm, I have to think about this more. But another case where I think that the handling of categorical features could be useful is in non-binary trees; not necessarily while learning but in making predictions more efficiently. E.g., assuming 3 classes that are perfectly separable by a "color" attribute:
color / | \ red green blue vs. red / \ green / \ blue /\ Also, I think one other problem with one-hot encoding are random forests. Let's say you have a dataset consisting of 5 features, 4 numerical features and 1 categorical feature. Now, if your categorical variable has let's say 30 possible values. After one-hot encoding, you have 34 features now, and the majority of the decision trees will only get the different "flavors" of the categorical variable to see -- you will basically build a random forest that effectively only "considers" one of the variables in the training set if I am not missing anything here. > On Nov 8, 2015, at 2:13 PM, Raphael C <drr...@gmail.com> wrote: > > On 8 November 2015 at 17:50, Sebastian Raschka <se.rasc...@gmail.com> wrote: >> >>> On Nov 8, 2015, at 11:32 AM, Raphael C <drr...@gmail.com> wrote: >>> >>> In terms of computational efficiency, one-hot encoding combined with >>> the support for sparse feature vectors seems to work well, at least >>> for me. I assume therefore >>> the problem must be in terms of classification accuracy. >> >> One thing comes to mind regarding the different solvers for the linear >> models. E.g., Newton’s method is O(n * d^2), and even gradient descent is >> O(n *d) >> >> For decision trees, I don’t see a substantial difference in terms of >> computational complexity if a categorical feature, let’s say it can take 4 >> values, is split into 4 binary questions (i.e., using one-hot encoding). One >> the other hand, I think the problem is that the decision algorithm does not >> no that these 4 binary questions “belong” to one observation, which could >> make the decision tree grow much larger in depth and width; this is bad for >> computational efficiency and would more likely produce trees with higher >> variance. >> > > I am unclear what difference it makes for decision trees myself. I am > no expert on the construction algorithms but I assume that they would > never split on a feature which depends 100% on a parent node as one > branch will just be empty. If that is right, it seems the decision > tree should not grow much larger. It might take more time I suppose > for the construction algorithm to work this out of course. > > It would be great if anyone had a concrete example where it made a > difference for a decision tree (or any classifier which uses decision > trees). > > Raphael > > ------------------------------------------------------------------------------ > _______________________________________________ > Scikit-learn-general mailing list > Scikit-learn-general@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/scikit-learn-general ------------------------------------------------------------------------------ _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general