Sean, sorry for missing out on the discussion.

Evan, you are correct, we are using the heuristic Sean suggested during the
multiclass PR for ordering high-arity categorical variables using the
impurity values for each categorical feature.

Joseph, thanks for fixing the bug which I think was a regression since we
added support for RFs. I don't think we have see this in 1.1.

-Manish

On Mon, Oct 13, 2014 at 11:55 AM, Joseph Bradley <jos...@databricks.com>
wrote:

> I think this is the fix:
>
> In this
> file:
> mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DTStatsAggregator.scala
>
> methods "getFeatureOffset" and "getLeftRightFeatureOffsets" have sanity
> checks ("require") which are correct for DecisionTree but not for
> RandomForest.  You can remove those.  I've sent a PR with this and a few
> other small fixes:
>
> https://github.com/apache/spark/pull/2785
>
> I hope this fixes the bug!
>
> On Mon, Oct 13, 2014 at 11:19 AM, Sean Owen <so...@cloudera.com> wrote:
>
> > Great, we'll confer then. I'm using master / 1.2.0-SNAPSHOT. I'll send
> > some details directly under separate cover.
> >
> > On Mon, Oct 13, 2014 at 7:12 PM, Joseph Bradley <jos...@databricks.com>
> > wrote:
> > > Hi Sean,
> > >
> > > Sorry I didn't see this thread earlier!  (Thanks Ameet for pinging me.)
> > >
> > > Short version: That exception should not be thrown, so there is a bug
> > > somewhere.  The intended logic for handling high-arity categorical
> > features
> > > is about the best one can do, as far as I know.
> > >
> > > Bug finding: For my checking purposes, which branch of Spark are you
> > using,
> > > and do you have the options being submitted to DecisionTree?
> > >
> > > High-arity categorical features: As you have figured out, if you use a
> > > categorical feature with just a few categories, it is treated as
> > "unordered"
> > > so that we explicitly consider all exponentially many ways to split the
> > > categories into 2 groups.  If you use one with many categories, then it
> > is
> > > necessary to impose an order.  (The communication increases linearly in
> > the
> > > number of possible splits, so it would blow up if we considered all
> > > exponentially many splits.)  This order is chosen separately for each
> > node,
> > > so it is not a uniform order imposed over the entire tree.  This
> actually
> > > means that it is not a heuristic for regression and binary
> > classification;
> > > i.e., it chooses the same split as if we had explicitly considered all
> of
> > > the possible splits.  For multiclass classification, it is a heuristic,
> > but
> > > I don't know of a better solution.
> > >
> > > I'll check the code, but if you can forward info about the bug, that
> > would
> > > be very helpful.
> > >
> > > Thanks!
> > > Joseph
> > >
> >
>

Reply via email to