On Mon, Jun 3, 2013 at 12:41 PM, Christian Jauvin <cjau...@gmail.com> wrote:

## Advertising

> > Sklearn does not implement any special treatment for categorical > variables. > > You can feed any float. The question is if it would work / what it does. > > I think I'm confused about a couple of aspects (that's what happens I > guess when you play with algorithms for which you don't have a > complete and firm understanding beforehand!). I assumed that > sklearn-RF's requirement for numerical inputs was just a data > representation/implementation aspect, and that once properly > transformed (i.e. using a LabelEncoder), it wouldn't matter, under the > hood, whether a predictor was categorical or numerical. > > Now if I understand you well, sklearn shouldn't be able to explicitly > handle the categorical case where no order exists (i.e. categorical, > as opposed to ordinal). > It comes down to what sort of decision can be made at each node. scikit-learn always uses decisions of the form (x > t) for some feature value x and some threshold t. Let's make this more concrete: you have a feature with possible values {A, B, C, D}. Ideal categorical treatment might partition a set of categories indicated by variable x so that each partition corresponds to a different child in the decision tree. So possible decisions would distinguish {A} from {B, C, D}; {B} from {A, C, D}; {C} from {A, B, D}; {D} from {A, B, C}; {A, B} from {C, D}; {A, C} from {B, D}; {A, D} from {B, C}. Scikit-learn can't make these sorts of splits... LabelEncoder will turn these into [0, 1, 2, 3]. Then only splits respecting the ordering are possible. So a single split can distinguish {A} from {B, C, D}; {A, B} from {C, D}; and {A, B, C} from {D}. LabelBinarizer will allow a single split to distinguish any one category from all others: {A} from {B, C, D}; {B} from {A, C, D}; {C} from {A, B, D}; {D} from {A, B, C}. Note that all these trees can represent the same hypothesis space, it just might require a deeper tree to represent the same thing (and the learning process can't take advantage of similar categories). However, in these last two cases, the number of possible splits at a single node is linear in the number of categories. Selecting an arbitrary partition allows exponentially many splits with respect to the number of categories (though there may be approximations to avoid evaluating all possible splits; I'm not familiar with the literature). So it should be quite clear that binarized categories allow the most meaningful decisions with the least complexity. Cheers, - Joel

------------------------------------------------------------------------------ Get 100% visibility into Java/.NET code with AppDynamics Lite It's a free troubleshooting tool designed for production Get down to code-level detail for bottlenecks, with <2% overhead. Download for free and get started troubleshooting in minutes. http://p.sf.net/sfu/appdyn_d2d_ap2

_______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general