I think this is the fix: In this file: mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DTStatsAggregator.scala
methods "getFeatureOffset" and "getLeftRightFeatureOffsets" have sanity checks ("require") which are correct for DecisionTree but not for RandomForest. You can remove those. I've sent a PR with this and a few other small fixes: https://github.com/apache/spark/pull/2785 I hope this fixes the bug! On Mon, Oct 13, 2014 at 11:19 AM, Sean Owen <so...@cloudera.com> wrote: > Great, we'll confer then. I'm using master / 1.2.0-SNAPSHOT. I'll send > some details directly under separate cover. > > On Mon, Oct 13, 2014 at 7:12 PM, Joseph Bradley <jos...@databricks.com> > wrote: > > Hi Sean, > > > > Sorry I didn't see this thread earlier! (Thanks Ameet for pinging me.) > > > > Short version: That exception should not be thrown, so there is a bug > > somewhere. The intended logic for handling high-arity categorical > features > > is about the best one can do, as far as I know. > > > > Bug finding: For my checking purposes, which branch of Spark are you > using, > > and do you have the options being submitted to DecisionTree? > > > > High-arity categorical features: As you have figured out, if you use a > > categorical feature with just a few categories, it is treated as > "unordered" > > so that we explicitly consider all exponentially many ways to split the > > categories into 2 groups. If you use one with many categories, then it > is > > necessary to impose an order. (The communication increases linearly in > the > > number of possible splits, so it would blow up if we considered all > > exponentially many splits.) This order is chosen separately for each > node, > > so it is not a uniform order imposed over the entire tree. This actually > > means that it is not a heuristic for regression and binary > classification; > > i.e., it chooses the same split as if we had explicitly considered all of > > the possible splits. For multiclass classification, it is a heuristic, > but > > I don't know of a better solution. > > > > I'll check the code, but if you can forward info about the bug, that > would > > be very helpful. > > > > Thanks! > > Joseph > > >