Github user etrain commented on the pull request:
https://github.com/apache/spark/pull/886#issuecomment-44360446
I am worried that exponential growth in the number of split possibilities
kills us when we "check for all splits" when we get to even 20-30
categorical values. That's potentially a billion possible candidates to
check. I have a feeling that heuristics will be more practical (but i don't
have a reference!). We might add an option for "checking for all" vs.
"using an entropy based heuristic" and automatically decide which to use at
some conservative threshold that is user-configurable.
On Tue, May 27, 2014 at 7:41 PM, manishamde <[email protected]>wrote:
> @srowen <https://github.com/srowen> It's good to know about the use-case
> for cardinality in the order of tens.
>
> The categorical feature ordering using the average value of the target
> variable works well for both binary classification and regression (section
> 9.2.4 of Elements of Statistical Learning) and it's already implemented in
> MLlib decision tree.
>
> This PR handles the scenario where the 'ordering' assumption does not hold
> true for the multiclass classification. I like the suggestion of using
> entropy to sort the categories -- it will be great if we could also find a
> theoretical reference for it!
>
> Here is what I propose for handling categorical features in multiclass
> classification:
> 1. We check for all splits of the categorical variable if the bin
> constraints are met.
> 2. If the bin constraints are not met, we can use a sorting heuristic
> (like entropy of the target variable)
>
> I think this might be the best tradeoff both from the theoretical and
> practical perspective and it will save the user a lot of data munging
> effort which is one of the main advantages of decision trees.
>
> â
> Reply to this email directly or view it on
GitHub<https://github.com/apache/spark/pull/886#issuecomment-44359841>
> .
>
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---