Hm, I have to think about this more. But another case where I think that the
handling of categorical features could be useful is in non-binary trees; not
necessarily while learning but in making predictions more efficiently. E.g.,
assuming 3 classes that are perfectly separable by a "color" attribute:
color
/ | \
red green blue
vs.
red
/ \
green
/ \
blue
/\
Also, I think one other problem with one-hot encoding are random forests. Let's
say you have a dataset consisting of 5 features, 4 numerical features and 1
categorical feature. Now, if your categorical variable has let's say 30
possible values. After one-hot encoding, you have 34 features now, and the
majority of the decision trees will only get the different "flavors" of the
categorical variable to see -- you will basically build a random forest that
effectively only "considers" one of the variables in the training set if I am
not missing anything here.
> On Nov 8, 2015, at 2:13 PM, Raphael C <[email protected]> wrote:
>
> On 8 November 2015 at 17:50, Sebastian Raschka <[email protected]> wrote:
>>
>>> On Nov 8, 2015, at 11:32 AM, Raphael C <[email protected]> wrote:
>>>
>>> In terms of computational efficiency, one-hot encoding combined with
>>> the support for sparse feature vectors seems to work well, at least
>>> for me. I assume therefore
>>> the problem must be in terms of classification accuracy.
>>
>> One thing comes to mind regarding the different solvers for the linear
>> models. E.g., Newton’s method is O(n * d^2), and even gradient descent is
>> O(n *d)
>>
>> For decision trees, I don’t see a substantial difference in terms of
>> computational complexity if a categorical feature, let’s say it can take 4
>> values, is split into 4 binary questions (i.e., using one-hot encoding). One
>> the other hand, I think the problem is that the decision algorithm does not
>> no that these 4 binary questions “belong” to one observation, which could
>> make the decision tree grow much larger in depth and width; this is bad for
>> computational efficiency and would more likely produce trees with higher
>> variance.
>>
>
> I am unclear what difference it makes for decision trees myself. I am
> no expert on the construction algorithms but I assume that they would
> never split on a feature which depends 100% on a parent node as one
> branch will just be empty. If that is right, it seems the decision
> tree should not grow much larger. It might take more time I suppose
> for the construction algorithm to work this out of course.
>
> It would be great if anyone had a concrete example where it made a
> difference for a decision tree (or any classifier which uses decision
> trees).
>
> Raphael
>
> ------------------------------------------------------------------------------
> _______________________________________________
> Scikit-learn-general mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
------------------------------------------------------------------------------
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general