Great, we'll confer then. I'm using master / 1.2.0-SNAPSHOT. I'll send
some details directly under separate cover.

On Mon, Oct 13, 2014 at 7:12 PM, Joseph Bradley <jos...@databricks.com> wrote:
> Hi Sean,
>
> Sorry I didn't see this thread earlier!  (Thanks Ameet for pinging me.)
>
> Short version: That exception should not be thrown, so there is a bug
> somewhere.  The intended logic for handling high-arity categorical features
> is about the best one can do, as far as I know.
>
> Bug finding: For my checking purposes, which branch of Spark are you using,
> and do you have the options being submitted to DecisionTree?
>
> High-arity categorical features: As you have figured out, if you use a
> categorical feature with just a few categories, it is treated as "unordered"
> so that we explicitly consider all exponentially many ways to split the
> categories into 2 groups.  If you use one with many categories, then it is
> necessary to impose an order.  (The communication increases linearly in the
> number of possible splits, so it would blow up if we considered all
> exponentially many splits.)  This order is chosen separately for each node,
> so it is not a uniform order imposed over the entire tree.  This actually
> means that it is not a heuristic for regression and binary classification;
> i.e., it chooses the same split as if we had explicitly considered all of
> the possible splits.  For multiclass classification, it is a heuristic, but
> I don't know of a better solution.
>
> I'll check the code, but if you can forward info about the bug, that would
> be very helpful.
>
> Thanks!
> Joseph
>

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to