On 06/02/2013 10:53 PM, Christian Jauvin wrote: > Hi Andreas, > >> Btw, you do encode the categorical variables using one-hot, right? >> The sklearn trees don't really support categorical variables. > I'm rather perplexed by this.. I assumed that sklearn's RF only > required its input to be numerical, so I only used a LabelEncoder up > to now. Hum. I have not considered that. Peter? Gilles? Lars? Little help?
Sklearn does not implement any special treatment for categorical variables. You can feed any float. The question is if it would work / what it does. I guess you (and kaggle) observed that it does work somewhat, not sure if it does what you want. The splits will be as for numerical variables, i.e. > threshold. If the variables have an ordering (and LabelEncoder respects that ordering), that makes sense. If the variables don't have an ordering (which I would assume is the more common case for categorical variables), I don't think that makes much sense. >My assumption was backed by two external sources of information: >(1) The benchmark code provided by Kaggle in the SO contest (which was >actually the first time I used RFs) didn't seem to perform such a >transformation: >https://github.com/benhamner/Stack-Overflow-Competition/blob/master/features.py I don't see where categorical variables are used in this code. Could you please point it out? > > (2) It doesn't seem to be mentioned in this Kaggle tutorial about RFs: > > http://www.kaggle.com/c/titanic-gettingStarted/details/getting-started-with-random-forests I am not that experienced with categorical variables. The catch here seems to be "not to many values". Maybe it works for "few" values, but it is not what I would expect a random forest implementation to do on categorical variables. I think it is rather bad that the tutorial doesn't mention one-hot encoding if it is using sklearn. It is somewhat trivial to perform the usual categorical tests. They are not implemented in sklearn, though, as there is no obvious way to declare a column a categorical variable (you need an auxiliar array and no one did this yet). > Moreoever, I just tested it with my own experiment, and I found that a > RF trained on a (21080 x 4) input matrix (i.e. 4 categorical > variables, non-one-hot encoded) performs the same (to the third > decimal in accuracy and AUC, with 10-fold CV) as with its equivalent, > one-hot encoded (21080 x 1347) matrix. Not sure what this says about your dataset / features. If the variables don't have any ordering and the splits take arbitrary subsets, that would seem a bit weird to me. > > Sorry if the confusion is on my side, but did I miss something? Maybe I'm just not well-versed enough in the use of numerically encoded categorical variables in random forests. Cheers, Andy ------------------------------------------------------------------------------ Get 100% visibility into Java/.NET code with AppDynamics Lite It's a free troubleshooting tool designed for production Get down to code-level detail for bottlenecks, with <2% overhead. Download for free and get started troubleshooting in minutes. http://p.sf.net/sfu/appdyn_d2d_ap2 _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general