On 8 November 2015 at 20:42, Sebastian Raschka <se.rasc...@gmail.com> wrote: > Hm, I have to think about this more. But another case where I think that the > handling of categorical features could be useful is in non-binary trees; not > necessarily while learning but in making predictions more efficiently. E.g., > assuming 3 classes that are perfectly separable by a "color" attribute: > > > > color > / | \ > red green blue > > > vs. > > > red > / \ > green > / \ > blue > /\ > > > Also, I think one other problem with one-hot encoding are random forests. > Let's say you have a dataset consisting of 5 features, 4 numerical features > and 1 categorical feature. Now, if your categorical variable has let's say 30 > possible values. After one-hot encoding, you have 34 features now, and the > majority of the decision trees will only get the different "flavors" of the > categorical variable to see -- you will basically build a random forest that > effectively only "considers" one of the variables in the training set if I am > not missing anything here. >
Your second point is particularly strong. You are right that one-hot encoding could massively overemphasise the importance of categorical features with many categories under all sorts of regularisation schemes (including the method used by random forests). I look forward to https://github.com/scikit-learn/scikit-learn/pull/4899 now :) Raphael ------------------------------------------------------------------------------ Presto, an open source distributed SQL query engine for big data, initially developed by Facebook, enables you to easily query your data on Hadoop in a more interactive manner. Teradata is also now providing full enterprise support for Presto. Download a free open source copy now. http://pubads.g.doubleclick.net/gampad/clk?id=250295911&iu=/4140 _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general