Github user actuaryzhang commented on the issue:

    https://github.com/apache/spark/pull/17879
  
    @holdenk The main motivation for this PR is that the behavior of 
StringIndexer will affect OneHotEncoder, RFormula and models estimated based on 
these transformers. There have been a few desired improvement in RFormula that 
could not be done without the change in StringIndexer.
    
    One use case for alphabetical ordering is to make comparison of Spark model 
results to that in R, which drops the first alphabetical value in one-hot 
encoding. Right now, even though we do lots of comparisons between Spark and R, 
we lack comparisons involving String features because the encoding is 
different. There is already a 
[JIRA|https://issues.apache.org/jira/browse/SPARK-14659]. 
    
    Another motivation for this PR is to support ascending order by label 
frequency. This is also related to one-hot encoding. In practical applications 
of regression type models, it is almost always better to set the most frequent 
label as the reference level (i.e., drop the most frequent label in 
OneHotEncoding) for better interpretability. Right now, the behavior is the 
opposite and has made it very difficult to interpret results. 
    
    I think  the flexibility of different ordering will benefit a lot the 
downstream feature transformers and model estimators. Does this make sense? 



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to