Hi Andreas, > Btw, you do encode the categorical variables using one-hot, right? > The sklearn trees don't really support categorical variables.
I'm rather perplexed by this.. I assumed that sklearn's RF only required its input to be numerical, so I only used a LabelEncoder up to now. My assumption was backed by two external sources of information: (1) The benchmark code provided by Kaggle in the SO contest (which was actually the first time I used RFs) didn't seem to perform such a transformation: https://github.com/benhamner/Stack-Overflow-Competition/blob/master/features.py (2) It doesn't seem to be mentioned in this Kaggle tutorial about RFs: http://www.kaggle.com/c/titanic-gettingStarted/details/getting-started-with-random-forests Moreoever, I just tested it with my own experiment, and I found that a RF trained on a (21080 x 4) input matrix (i.e. 4 categorical variables, non-one-hot encoded) performs the same (to the third decimal in accuracy and AUC, with 10-fold CV) as with its equivalent, one-hot encoded (21080 x 1347) matrix. Sorry if the confusion is on my side, but did I miss something? Christian ------------------------------------------------------------------------------ Get 100% visibility into Java/.NET code with AppDynamics Lite It's a free troubleshooting tool designed for production Get down to code-level detail for bottlenecks, with <2% overhead. Download for free and get started troubleshooting in minutes. http://p.sf.net/sfu/appdyn_d2d_ap2 _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general