Great, we'll confer then. I'm using master / 1.2.0-SNAPSHOT. I'll send some details directly under separate cover.
On Mon, Oct 13, 2014 at 7:12 PM, Joseph Bradley <jos...@databricks.com> wrote: > Hi Sean, > > Sorry I didn't see this thread earlier! (Thanks Ameet for pinging me.) > > Short version: That exception should not be thrown, so there is a bug > somewhere. The intended logic for handling high-arity categorical features > is about the best one can do, as far as I know. > > Bug finding: For my checking purposes, which branch of Spark are you using, > and do you have the options being submitted to DecisionTree? > > High-arity categorical features: As you have figured out, if you use a > categorical feature with just a few categories, it is treated as "unordered" > so that we explicitly consider all exponentially many ways to split the > categories into 2 groups. If you use one with many categories, then it is > necessary to impose an order. (The communication increases linearly in the > number of possible splits, so it would blow up if we considered all > exponentially many splits.) This order is chosen separately for each node, > so it is not a uniform order imposed over the entire tree. This actually > means that it is not a heuristic for regression and binary classification; > i.e., it chooses the same split as if we had explicitly considered all of > the possible splits. For multiclass classification, it is a heuristic, but > I don't know of a better solution. > > I'll check the code, but if you can forward info about the bug, that would > be very helpful. > > Thanks! > Joseph > --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org