Github user actuaryzhang commented on the issue: https://github.com/apache/spark/pull/17967 @yanboliang Thanks for the question. The alphabetically ascending order in R is very convenient for display purpose. For example, when you do a summary of model results, the results will be easier to understand if it is in alphabetically ascending order. That's the default, but oftentimes users will reset the reference level to make the most frequent level as the base (the one dropped in one-hot encoding). This also facilitates interpretation, because the most frequent level can be roughly regarded as the population average (in very unbalanced data). Otherwise, especially in unbalanced data, the contrast between categories with few data is most times insignificant. Of course, this does not change the model, but it is very important for interpretation. I understand that ordering string levels by descending frequency is helpful for other applications like tree based split decisions. But it will make the ML library much better if we can support these other options that are often used in day-to-day work. This will broaden the use case of Spark ML.
--- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org