Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

Andreas Mueller Sun, 02 Jun 2013 23:51:51 -0700

On 06/03/2013 04:41 AM, Christian Jauvin wrote:
>> Sklearn does not implement any special treatment for categorical variables.
>> You can feed any float. The question is if it would work / what it does.
> I think I'm confused about a couple of aspects (that's what happens I
> guess when you play with algorithms for which you don't have a
> complete and firm understanding beforehand!). I assumed that
> sklearn-RF's requirement for numerical inputs was just a data
> representation/implementation aspect, and that once properly
> transformed (i.e. using a LabelEncoder), it wouldn't matter, under the
> hood, whether a predictor was categorical or numerical.
>
> Now if I understand you well, sklearn shouldn't be able to explicitly
> handle the categorical case where no order exists (i.e. categorical,
> as opposed to ordinal).
Yes. At least the splitting criterion is not the one usually used.
>
> But you seem to also imply that sklearn can indirectly support it
> using dummy variables..
Yes.
>
> Bigger question: given that Decision Trees (in general) support pure
> categorical variables.. shouldn't Random Forests also do?
>
As I said, trees in sklearn don't. But that is a purely implementation / 
API problem.


>
>> Not sure what this says about your dataset / features.
>> If the variables don't have any ordering and the splits take arbitrary
>> subsets, that would seem a bit weird to me.
> In fact that's really what I observe: apart from the first of my 4
> variables, which is a year, the remaining 3 are purely categorical,
> with no implicit order. So that result is weird because it is not in
> line with what you've been saying.
Actually I think all classifiers can also be represented by treating the 
categorical features as ordinal ones,
it is just that the tree needs to be deeper and the splits are a bit 
weird. Imagine if you want to get category
c out of a, b, c, d, e, you have to threshold between b and c and then 
between c and d, so you get three
branches ('a', 'b'), ('c'), ('d', 'e'). If there is no ordering to the 
variables, that is really weird.
If you have enough data, it might not make a difference, though - if you 
trees are not to deep (and many)
you can dump them using dot.

I don't have time to look at the documentation now, but maybe we should 
clear it up a bit.
Also, maybe we should tell the kaggle folks to add sentence to their 
tutorial.

Cheers,
Andy

------------------------------------------------------------------------------
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with <2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

Reply via email to