Github user manishamde commented on the pull request:

    https://github.com/apache/spark/pull/886#issuecomment-44359841
  
    @srowen It's good to know about the use-case for cardinality in the order 
of tens.
    
    The categorical feature ordering using the average value of the target 
variable works well for both binary classification and regression (section 
9.2.4 of Elements of Statistical Learning) and it's already implemented in 
MLlib decision tree. 
    
    This PR handles the scenario where the 'ordering' assumption does not hold 
true for the multiclass classification. I like the suggestion of using entropy 
to sort the categories -- it will be great if we could also find a theoretical 
reference for it!
    
    Here is what I propose for handling categorical features in multiclass 
classification:
    1. We check for all splits of the categorical variable if the bin 
constraints are met.
    2. If the bin constraints are not met, we can use a sorting heuristic (like 
entropy of the target variable)
    
    I think this might be the best tradeoff both from the theoretical and 
practical perspective and it will save the user a lot of data munging effort 
which is one of the main advantages of decision trees.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

Reply via email to