Re: Decision forests don't work with non-trivial categorical features

Joseph Bradley Mon, 13 Oct 2014 11:55:49 -0700

I think this is the fix:

In this
file: 
mllib/src/main/scala/org/apache/spark/mllib/tree/impl/DTStatsAggregator.scala


methods "getFeatureOffset" and "getLeftRightFeatureOffsets" have sanity
checks ("require") which are correct for DecisionTree but not for
RandomForest.  You can remove those.  I've sent a PR with this and a few
other small fixes:

https://github.com/apache/spark/pull/2785

I hope this fixes the bug!

On Mon, Oct 13, 2014 at 11:19 AM, Sean Owen <[email protected]> wrote:

> Great, we'll confer then. I'm using master / 1.2.0-SNAPSHOT. I'll send
> some details directly under separate cover.
>
> On Mon, Oct 13, 2014 at 7:12 PM, Joseph Bradley <[email protected]>
> wrote:
> > Hi Sean,
> >
> > Sorry I didn't see this thread earlier!  (Thanks Ameet for pinging me.)
> >
> > Short version: That exception should not be thrown, so there is a bug
> > somewhere.  The intended logic for handling high-arity categorical
> features
> > is about the best one can do, as far as I know.
> >
> > Bug finding: For my checking purposes, which branch of Spark are you
> using,
> > and do you have the options being submitted to DecisionTree?
> >
> > High-arity categorical features: As you have figured out, if you use a
> > categorical feature with just a few categories, it is treated as
> "unordered"
> > so that we explicitly consider all exponentially many ways to split the
> > categories into 2 groups.  If you use one with many categories, then it
> is
> > necessary to impose an order.  (The communication increases linearly in
> the
> > number of possible splits, so it would blow up if we considered all
> > exponentially many splits.)  This order is chosen separately for each
> node,
> > so it is not a uniform order imposed over the entire tree.  This actually
> > means that it is not a heuristic for regression and binary
> classification;
> > i.e., it chooses the same split as if we had explicitly considered all of
> > the possible splits.  For multiclass classification, it is a heuristic,
> but
> > I don't know of a better solution.
> >
> > I'll check the code, but if you can forward info about the bug, that
> would
> > be very helpful.
> >
> > Thanks!
> > Joseph
> >
>

Re: Decision forests don't work with non-trivial categorical features

Reply via email to