[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

actuaryzhang Fri, 19 May 2017 14:54:24 -0700

Github user actuaryzhang commented on the issue:

    https://github.com/apache/spark/pull/17967
  
    @yanboliang Thanks for the question. 
    
    The alphabetically ascending order in R is very convenient for display 
purpose. For example, when you do a summary of model results, the results will 
be easier to understand if it is in alphabetically ascending order. 
    
    That's the default, but oftentimes users will reset the reference level to 
make the most frequent level as the base (the one dropped in one-hot encoding). 
This also facilitates interpretation, because the most frequent level can be 
roughly regarded as the population average (in very unbalanced data). 
Otherwise, especially in unbalanced data, the contrast between categories with 
few data is most times insignificant. Of course, this does not change the 
model, but it is very important for interpretation. 
    
    I understand that ordering string levels by descending frequency is helpful 
for other applications like tree based split decisions. But it will make the ML 
library much better if we can support these other options that are often used 
in day-to-day work. This will broaden the use case of Spark ML.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark issue #17967: [SPARK-14659][ML] RFormula consistent with R when handli...

Reply via email to