On 06/03/2013 04:41 AM, Christian Jauvin wrote: >> Sklearn does not implement any special treatment for categorical variables. >> You can feed any float. The question is if it would work / what it does. > I think I'm confused about a couple of aspects (that's what happens I > guess when you play with algorithms for which you don't have a > complete and firm understanding beforehand!). I assumed that > sklearn-RF's requirement for numerical inputs was just a data > representation/implementation aspect, and that once properly > transformed (i.e. using a LabelEncoder), it wouldn't matter, under the > hood, whether a predictor was categorical or numerical. > > Now if I understand you well, sklearn shouldn't be able to explicitly > handle the categorical case where no order exists (i.e. categorical, > as opposed to ordinal). Yes. At least the splitting criterion is not the one usually used. > > But you seem to also imply that sklearn can indirectly support it > using dummy variables.. Yes. > > Bigger question: given that Decision Trees (in general) support pure > categorical variables.. shouldn't Random Forests also do? > As I said, trees in sklearn don't. But that is a purely implementation / API problem.
> >> Not sure what this says about your dataset / features. >> If the variables don't have any ordering and the splits take arbitrary >> subsets, that would seem a bit weird to me. > In fact that's really what I observe: apart from the first of my 4 > variables, which is a year, the remaining 3 are purely categorical, > with no implicit order. So that result is weird because it is not in > line with what you've been saying. Actually I think all classifiers can also be represented by treating the categorical features as ordinal ones, it is just that the tree needs to be deeper and the splits are a bit weird. Imagine if you want to get category c out of a, b, c, d, e, you have to threshold between b and c and then between c and d, so you get three branches ('a', 'b'), ('c'), ('d', 'e'). If there is no ordering to the variables, that is really weird. If you have enough data, it might not make a difference, though - if you trees are not to deep (and many) you can dump them using dot. I don't have time to look at the documentation now, but maybe we should clear it up a bit. Also, maybe we should tell the kaggle folks to add sentence to their tutorial. Cheers, Andy ------------------------------------------------------------------------------ Get 100% visibility into Java/.NET code with AppDynamics Lite It's a free troubleshooting tool designed for production Get down to code-level detail for bottlenecks, with <2% overhead. Download for free and get started troubleshooting in minutes. http://p.sf.net/sfu/appdyn_d2d_ap2 _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general