Github user manishamde commented on the pull request:
https://github.com/apache/spark/pull/886#issuecomment-44359841
@srowen It's good to know about the use-case for cardinality in the order
of tens.
The categorical feature ordering using the average value of the target
variable works well for both binary classification and regression (section
9.2.4 of Elements of Statistical Learning) and it's already implemented in
MLlib decision tree.
This PR handles the scenario where the 'ordering' assumption does not hold
true for the multiclass classification. I like the suggestion of using entropy
to sort the categories -- it will be great if we could also find a theoretical
reference for it!
Here is what I propose for handling categorical features in multiclass
classification:
1. We check for all splits of the categorical variable if the bin
constraints are met.
2. If the bin constraints are not met, we can use a sorting heuristic (like
entropy of the target variable)
I think this might be the best tradeoff both from the theoretical and
practical perspective and it will save the user a lot of data munging effort
which is one of the main advantages of decision trees.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---