On 3 June 2013 08:43, Andreas Mueller <amuel...@ais.uni-bonn.de> wrote: > On 06/03/2013 05:19 AM, Joel Nothman wrote: >> >> However, in these last two cases, the number of possible splits at a >> single node is linear in the number of categories. Selecting an >> arbitrary partition allows exponentially many splits with respect to >> the number of categories (though there may be approximations to avoid >> evaluating all possible splits; I'm not familiar with the literature). >> > I think the standard split is asking whether a variable is equal to a > value, i.e. selecting subsets of 1. That is possible to do with two > thresholds, but leads to a weird tree in a way.
Yes, CART builds binary decision trees. (The algorithm which splits a node into as many children as the number of values of the variable is ID3.) As introduced by Breiman in his book, for a categorical variable X taking its value in {1, ..., L}, the strategy is to consider every subset S \subseteq {1, ..., L} of values of the variable and to pick the one leading to the largest reduction of impurity. As such, splits are defined as yes-no questions of the form "is x in S?". In scikit-learn, we dont implement that. The main reason is that it blow up computing times: if L is the cardinality of X, then there are 2^L-1 subsets to consider. The best that you can do with our implementation is to one-hot encode your categorical variables which will amount to select subsets of size 1, as Andy said. If you don't one-hot encode your categorical variables, then you have to be aware that the construction procedure will implicitely assume that the categorical values are ordered (which may make no sense depending on your dataset). Gilles ------------------------------------------------------------------------------ Get 100% visibility into Java/.NET code with AppDynamics Lite It's a free troubleshooting tool designed for production Get down to code-level detail for bottlenecks, with <2% overhead. Download for free and get started troubleshooting in minutes. http://p.sf.net/sfu/appdyn_d2d_ap2 _______________________________________________ Scikit-learn-general mailing list Scikit-learn-general@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/scikit-learn-general