Re: [Scikit-learn-general] Comparisons of classifiers

Raphael C Sun, 08 Nov 2015 23:32:13 -0800

On 8 November 2015 at 20:42, Sebastian Raschka <se.rasc...@gmail.com> wrote:
> Hm, I have to think about this more. But another case where I think that the 
> handling of categorical features could be useful is in non-binary trees; not 
> necessarily while learning but in making predictions more efficiently. E.g., 
> assuming 3 classes that are perfectly separable by a "color" attribute:
>
>
>
>                     color
>            /           |         \
>          red      green     blue
>
>
> vs.
>
>
>                    red
>               /             \
>           green
>          /     \
>      blue
>       /\
>
>
> Also, I think one other problem with one-hot encoding are random forests. 
> Let's say you have a dataset consisting of 5 features, 4 numerical features 
> and 1 categorical feature. Now, if your categorical variable has let's say 30 
> possible values. After one-hot encoding, you have 34 features now, and the 
> majority of the decision trees will only get the different "flavors" of the 
> categorical variable to see -- you will basically build a random forest that 
> effectively only "considers" one of the variables in the training set if I am 
> not missing anything here.
>



Your second point is particularly strong.  You are right that one-hot
encoding could massively overemphasise the importance of categorical
features with many categories under all sorts of regularisation
schemes (including the method used by random forests).

I look forward to  https://github.com/scikit-learn/scikit-learn/pull/4899 now :)

Raphael

------------------------------------------------------------------------------
Presto, an open source distributed SQL query engine for big data, initially
developed by Facebook, enables you to easily query your data on Hadoop in a 
more interactive manner. Teradata is also now providing full enterprise
support for Presto. Download a free open source copy now.
http://pubads.g.doubleclick.net/gampad/clk?id=250295911&iu=/4140
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Comparisons of classifiers

Reply via email to