Github user srowen commented on the pull request:
https://github.com/apache/spark/pull/886#issuecomment-44243946
@manishamde Yes for categorical features with high cardinality, you don't
want to consider all possible splits. I don't think having a cardinality of 30
or 40 is that unusual though. Honestly I've always resented the fact that R
simply can't handle more than 32!
There are heuristics however that work well while efficiently considering a
number of splits linear in the number of values. For regression, it's
apparently optimal to sort the categorical values by average value of the
target variable, and then consider just prefixes of that list of values as the
subsets to try. Google's PLANET paper claims that is optimal.
For classification, where the target itself is categorical, I don't know of
a provably optimal way to do it. The heuristic I have used is to sort the
categorical values by the entropy of the target value. This seems pretty OK.
There is some Java code for creating the decision rules to evaluate here,
in `CategoricalDecision.java` and `NumericDecision.java`:
https://github.com/cloudera/oryx/tree/master/rdf-common/src/main/java/com/cloudera/oryx/rdf/common/rule
It's pretty easy to lift them and Scala-fy it. I'd really like to see
functionality like this so MLlib RDF can be comparable and I can move to it.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---