On 3 June 2013 08:43, Andreas Mueller <amuel...@ais.uni-bonn.de> wrote:
> On 06/03/2013 05:19 AM, Joel Nothman wrote:
>>
>> However, in these last two cases, the number of possible splits at a
>> single node is linear in the number of categories. Selecting an
>> arbitrary partition allows exponentially many splits with respect to
>> the number of categories (though there may be approximations to avoid
>> evaluating all possible splits; I'm not familiar with the literature).
>>
> I think the standard split is asking whether a variable is equal to a
> value, i.e. selecting subsets of 1. That is possible to do with two
> thresholds, but leads to a weird tree in a way.

Yes, CART builds binary decision trees. (The algorithm which splits a
node into as many children as the number of values of the variable is
ID3.)

As introduced by Breiman in his book, for a categorical variable X
taking its value in {1, ..., L}, the strategy is to consider every
subset S \subseteq {1, ..., L} of values of the variable and to pick
the one leading to the largest reduction of impurity. As such, splits
are defined as yes-no questions of the form "is x in S?".

In scikit-learn, we dont implement that. The main reason is that it
blow up computing times: if L is the cardinality of X, then there are
2^L-1 subsets to consider. The best that you can do with our
implementation is to one-hot encode your categorical variables which
will amount to select subsets of size 1, as Andy said. If you don't
one-hot encode your categorical variables, then you have to be aware
that the construction procedure will implicitely assume that the
categorical values are ordered (which may make no sense depending on
your dataset).

Gilles

------------------------------------------------------------------------------
Get 100% visibility into Java/.NET code with AppDynamics Lite
It's a free troubleshooting tool designed for production
Get down to code-level detail for bottlenecks, with <2% overhead.
Download for free and get started troubleshooting in minutes.
http://p.sf.net/sfu/appdyn_d2d_ap2
_______________________________________________
Scikit-learn-general mailing list
Scikit-learn-general@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to