Hm, I have to think about this more. But another case where I think that the 
handling of categorical features could be useful is in non-binary trees; not 
necessarily while learning but in making predictions more efficiently. E.g., 
assuming 3 classes that are perfectly separable by a "color" attribute:



                    color         
           /           |         \
         red      green     blue


vs.


                   red
              /             \ 
          green         
         /     \            
     blue   
      /\              


Also, I think one other problem with one-hot encoding are random forests. Let's 
say you have a dataset consisting of 5 features, 4 numerical features and 1 
categorical feature. Now, if your categorical variable has let's say 30 
possible values. After one-hot encoding, you have 34 features now, and the 
majority of the decision trees will only get the different "flavors" of the 
categorical variable to see -- you will basically build a random forest that 
effectively only "considers" one of the variables in the training set if I am 
not missing anything here.

> On Nov 8, 2015, at 2:13 PM, Raphael C <drr...@gmail.com> wrote:
> 
> On 8 November 2015 at 17:50, Sebastian Raschka <se.rasc...@gmail.com> wrote:
>> 
>>> On Nov 8, 2015, at 11:32 AM, Raphael C <drr...@gmail.com> wrote:
>>> 
>>> In terms of computational efficiency, one-hot encoding combined with
>>> the support for sparse feature vectors seems to work well, at least
>>> for me. I assume therefore
>>> the problem must be in terms of classification accuracy.
>> 
>> One thing comes to mind regarding the different solvers for the linear 
>> models. E.g., Newton’s method is O(n * d^2), and even gradient descent is 
>> O(n *d)
>> 
>> For decision trees, I don’t see a substantial difference in terms of 
>> computational complexity if a categorical feature, let’s say it can take 4 
>> values, is split into 4 binary questions (i.e., using one-hot encoding). One 
>> the other hand, I think the problem is that the decision algorithm does not 
>> no that these 4 binary questions “belong” to one observation, which could 
>> make the decision tree grow much larger in depth and width; this is bad for 
>> computational efficiency and would more likely produce trees with higher 
>> variance.
>> 
> 
> I am unclear what difference it makes for decision trees myself. I am
> no expert on the construction algorithms but I assume that they would
> never split on a feature which depends 100% on a parent node as one
> branch will just be empty. If that is right, it seems the decision
> tree should not grow much larger.  It might take more time I suppose
> for the construction algorithm to work this out of course.
> 
> It would be great if anyone had a concrete example where it made a
> difference for a decision tree (or any classifier which uses decision
> trees).
> 
> Raphael
> 
> ------------------------------------------------------------------------------
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general


------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to