Hi Andreas,

> Btw, you do encode the categorical variables using one-hot, right?
> The sklearn trees don't really support categorical variables.

I'm rather perplexed by this.. I assumed that sklearn's RF only
required its input to be numerical, so I only used a LabelEncoder up
to now.

My assumption was backed by two external sources of information:

(1) The benchmark code provided by Kaggle in the SO contest (which was
actually the first time I used RFs) didn't seem to perform such a
transformation:

https://github.com/benhamner/Stack-Overflow-Competition/blob/master/features.py

(2) It doesn't seem to be mentioned in this Kaggle tutorial about RFs:

http://www.kaggle.com/c/titanic-gettingStarted/details/getting-started-with-random-forests

Moreoever, I just tested it with my own experiment, and I found that a
RF trained on a (21080 x 4) input matrix (i.e. 4 categorical
variables, non-one-hot encoded) performs the same (to the third
decimal in accuracy and AUC, with 10-fold CV) as with its equivalent,
one-hot encoded (21080 x 1347) matrix.

Sorry if the confusion is on my side, but did I miss something?

Christian

------------------------------------------------------------------------------
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with <2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to