On Mon, Jun 3, 2013 at 12:41 PM, Christian Jauvin <cjau...@gmail.com> wrote:

> > Sklearn does not implement any special treatment for categorical
> variables.
> > You can feed any float. The question is if it would work / what it does.
> I think I'm confused about a couple of aspects (that's what happens I
> guess when you play with algorithms for which you don't have a
> complete and firm understanding beforehand!). I assumed that
> sklearn-RF's requirement for numerical inputs was just a data
> representation/implementation aspect, and that once properly
> transformed (i.e. using a LabelEncoder), it wouldn't matter, under the
> hood, whether a predictor was categorical or numerical.
> Now if I understand you well, sklearn shouldn't be able to explicitly
> handle the categorical case where no order exists (i.e. categorical,
> as opposed to ordinal).

It comes down to what sort of decision can be made at each node.
scikit-learn always uses decisions of the form (x > t) for some feature
value x and some threshold t.

Let's make this more concrete: you have a feature with possible values {A,
B, C, D}.

Ideal categorical treatment might partition a set of categories indicated
by variable x so that each partition corresponds to a different child in
the decision tree. So possible decisions would distinguish {A} from {B, C,
D}; {B} from {A, C, D}; {C} from {A, B, D}; {D} from {A, B, C}; {A, B} from
{C, D}; {A, C} from {B, D}; {A, D} from {B, C}. Scikit-learn can't make
these sorts of splits...

LabelEncoder will turn these into [0, 1, 2, 3]. Then only splits respecting
the ordering are possible. So a single split can distinguish {A} from {B,
C, D}; {A, B} from {C, D}; and {A, B, C} from {D}.

LabelBinarizer will allow a single split to distinguish any one category
from all others: {A} from {B, C, D}; {B} from {A, C, D}; {C} from {A, B,
D}; {D} from {A, B, C}.

Note that all these trees can represent the same hypothesis space, it just
might require a deeper tree to represent the same thing (and the learning
process can't take advantage of similar categories).

However, in these last two cases, the number of possible splits at a single
node is linear in the number of categories. Selecting an arbitrary
partition allows exponentially many splits with respect to the number of
categories (though there may be approximations to avoid evaluating all
possible splits; I'm not familiar with the literature).

So it should be quite clear that binarized categories allow the most
meaningful decisions with the least complexity.


- Joel
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with <2% overhead.
Download for free and get started troubleshooting in minutes.
Scikit-learn-general mailing list

Reply via email to