On 06/02/2013 10:53 PM, Christian Jauvin wrote:
> Hi Andreas,
>
>> Btw, you do encode the categorical variables using one-hot, right?
>> The sklearn trees don't really support categorical variables.
> I'm rather perplexed by this.. I assumed that sklearn's RF only
> required its input to be numerical, so I only used a LabelEncoder up
> to now.
Hum. I have not considered that. Peter? Gilles? Lars? Little help?

Sklearn does not implement any special treatment for categorical variables.
You can feed any float. The question is if it would work / what it does.

I guess you (and kaggle) observed that it does work somewhat, not sure 
if it does what you want. The splits will be as for numerical
variables, i.e. > threshold. If the variables have an ordering (and 
LabelEncoder respects that ordering), that makes sense. If the variables
don't have an ordering (which I would assume is the more common case for 
categorical variables), I don't think that makes much sense.


>My assumption was backed by two external sources of information:

>(1) The benchmark code provided by Kaggle in the SO contest (which was
>actually the first time I used RFs) didn't seem to perform such a
>transformation:

>https://github.com/benhamner/Stack-Overflow-Competition/blob/master/features.py

I don't see where categorical variables are used in this code. Could you 
please point it out?

>
> (2) It doesn't seem to be mentioned in this Kaggle tutorial about RFs:
>
> http://www.kaggle.com/c/titanic-gettingStarted/details/getting-started-with-random-forests
I am not that experienced with categorical variables. The catch here 
seems to be "not to many values".
Maybe it works for "few" values, but it is not what I would expect a 
random forest implementation to do
on categorical variables.

I think it is rather bad that the tutorial doesn't mention one-hot 
encoding if it is using sklearn.
It is somewhat trivial to perform the usual categorical tests. They are 
not implemented in sklearn, though,
as there is no obvious way to declare a column a categorical variable 
(you need an auxiliar array and no one did this yet).

> Moreoever, I just tested it with my own experiment, and I found that a
> RF trained on a (21080 x 4) input matrix (i.e. 4 categorical
> variables, non-one-hot encoded) performs the same (to the third
> decimal in accuracy and AUC, with 10-fold CV) as with its equivalent,
> one-hot encoded (21080 x 1347) matrix.
Not sure what this says about your dataset / features.
If the variables don't have any ordering and the splits take arbitrary 
subsets, that would seem a bit weird to me.
>
> Sorry if the confusion is on my side, but did I miss something?
Maybe I'm just not well-versed enough in the use of numerically encoded 
categorical variables in random forests.

Cheers,
Andy

------------------------------------------------------------------------------
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with <2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to