Sean, sorry for missing out on the discussion. Evan, you are correct, we are using the heuristic Sean suggested during the multiclass PR for ordering high-arity categorical variables using the impurity values for each categorical feature.
Joseph, thanks for fixing the bug which I think was a regression since we added support for RFs. I don't think we have see this in 1.1. -Manish On Mon, Oct 13, 2014 at 11:55 AM, Joseph Bradley <jos...@databricks.com> wrote: > I think this is the fix: > > In this > file: > mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DTStatsAggregator.scala > > methods "getFeatureOffset" and "getLeftRightFeatureOffsets" have sanity > checks ("require") which are correct for DecisionTree but not for > RandomForest. You can remove those. I've sent a PR with this and a few > other small fixes: > > https://github.com/apache/spark/pull/2785 > > I hope this fixes the bug! > > On Mon, Oct 13, 2014 at 11:19 AM, Sean Owen <so...@cloudera.com> wrote: > > > Great, we'll confer then. I'm using master / 1.2.0-SNAPSHOT. I'll send > > some details directly under separate cover. > > > > On Mon, Oct 13, 2014 at 7:12 PM, Joseph Bradley <jos...@databricks.com> > > wrote: > > > Hi Sean, > > > > > > Sorry I didn't see this thread earlier! (Thanks Ameet for pinging me.) > > > > > > Short version: That exception should not be thrown, so there is a bug > > > somewhere. The intended logic for handling high-arity categorical > > features > > > is about the best one can do, as far as I know. > > > > > > Bug finding: For my checking purposes, which branch of Spark are you > > using, > > > and do you have the options being submitted to DecisionTree? > > > > > > High-arity categorical features: As you have figured out, if you use a > > > categorical feature with just a few categories, it is treated as > > "unordered" > > > so that we explicitly consider all exponentially many ways to split the > > > categories into 2 groups. If you use one with many categories, then it > > is > > > necessary to impose an order. (The communication increases linearly in > > the > > > number of possible splits, so it would blow up if we considered all > > > exponentially many splits.) This order is chosen separately for each > > node, > > > so it is not a uniform order imposed over the entire tree. This > actually > > > means that it is not a heuristic for regression and binary > > classification; > > > i.e., it chooses the same split as if we had explicitly considered all > of > > > the possible splits. For multiclass classification, it is a heuristic, > > but > > > I don't know of a better solution. > > > > > > I'll check the code, but if you can forward info about the bug, that > > would > > > be very helpful. > > > > > > Thanks! > > > Joseph > > > > > >