Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

Peter Prettenhofer Mon, 03 Jun 2013 00:16:53 -0700

Our decision tree implementation only supports numerical splits; i.e. if
tests val < threshold .


Categorical features need to be encoded properly. I recommend one-hot
encoding for features with small cardinality (e.g. < 50) and ordinal
encoding (simply assign each category an integer value) for features with
large cardinality. Sufficiently deep decision trees will handle ordinal
encoded categorical features nicely - the same holds for boosting models
with a sufficient number of trees (see [1]).
Furthermore, ordinal encoding might be more efficient than one-hot encoding
since fewer features need to be searched. One-hot encoding, on the other
hand, plays much more nicely with mode interpretation.

Regarding split tests for categorical variables: there are two types of
tests I'm aware of: the equality test (val = cat) and the subset test (val
in {cat-subset}). While the latter sounds more powerful it has to be
considered harmful. Subset tests give rise to 2^(K-1) - 1 potential
splitting points per categorical feature whereas numerical features only
have N - 1 potential split points ( where N is the number of examples and K
is the cardinality of the cat. feature). Large number of potential split
points can lead to sever overfitting (you most likely find a subset that
perfectly separates your data). AFAIK R's random forest package only
supports subset tests so it might in fact be advantageous to use ordinal
encoding there too when your features have large cardinality.

HTH,
 peter

[1]
http://www.salford-systems.com/en/blog/dan-steinberg/item/15-modeling-tricks-with-treenet-treating-categorical-variables-as-continuous

PS: regarding the Kaggle tutorial - they most likely were not aware of that

2013/6/3 Andreas Mueller <amuel...@ais.uni-bonn.de>

> On 06/03/2013 04:41 AM, Christian Jauvin wrote:
> >> Sklearn does not implement any special treatment for categorical
> variables.
> >> You can feed any float. The question is if it would work / what it does.
> > I think I'm confused about a couple of aspects (that's what happens I
> > guess when you play with algorithms for which you don't have a
> > complete and firm understanding beforehand!). I assumed that
> > sklearn-RF's requirement for numerical inputs was just a data
> > representation/implementation aspect, and that once properly
> > transformed (i.e. using a LabelEncoder), it wouldn't matter, under the
> > hood, whether a predictor was categorical or numerical.
> >
> > Now if I understand you well, sklearn shouldn't be able to explicitly
> > handle the categorical case where no order exists (i.e. categorical,
> > as opposed to ordinal).
> Yes. At least the splitting criterion is not the one usually used.
> >
> > But you seem to also imply that sklearn can indirectly support it
> > using dummy variables..
> Yes.
> >
> > Bigger question: given that Decision Trees (in general) support pure
> > categorical variables.. shouldn't Random Forests also do?
> >
> As I said, trees in sklearn don't. But that is a purely implementation /
> API problem.
>
> >
> >> Not sure what this says about your dataset / features.
> >> If the variables don't have any ordering and the splits take arbitrary
> >> subsets, that would seem a bit weird to me.
> > In fact that's really what I observe: apart from the first of my 4
> > variables, which is a year, the remaining 3 are purely categorical,
> > with no implicit order. So that result is weird because it is not in
> > line with what you've been saying.
> Actually I think all classifiers can also be represented by treating the
> categorical features as ordinal ones,
> it is just that the tree needs to be deeper and the splits are a bit
> weird. Imagine if you want to get category
> c out of a, b, c, d, e, you have to threshold between b and c and then
> between c and d, so you get three
> branches ('a', 'b'), ('c'), ('d', 'e'). If there is no ordering to the
> variables, that is really weird.
> If you have enough data, it might not make a difference, though - if you
> trees are not to deep (and many)
> you can dump them using dot.
>
> I don't have time to look at the documentation now, but maybe we should
> clear it up a bit.
> Also, maybe we should tell the kaggle folks to add sentence to their
> tutorial.
>
> Cheers,
> Andy
>
>
> ------------------------------------------------------------------------------
> Get 100% visibility into Java/.NET code with AppDynamics Lite
> It's a free troubleshooting tool designed for production
> Get down to code-level detail for bottlenecks, with <2% overhead.
> Download for free and get started troubleshooting in minutes.
> http://p.sf.net/sfu/appdyn_d2d_ap2
> _______________________________________________
> Scikit-learn-general mailing list
> Scikit-learn-general@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/scikit-learn-general
>



-- 
Peter Prettenhofer

------------------------------------------------------------------------------
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with <2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2

_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Re: [Scikit-learn-general] Random Forest with a mix of categorical and lexical features

Reply via email to