[GitHub] spark pull request: [SPARK-5688][MLLIB] Randomize splits for categ...

edenovit Mon, 09 Feb 2015 12:36:30 -0800

Github user edenovit commented on the pull request:

    https://github.com/apache/spark/pull/4475#issuecomment-73584856
  
    I chose that approach since there was no preprocessing within the library, 
so I assumed that choosing uniformly at random would be the best option. 
However, as you mentioned, this approach also kills the option to do any 
preprocessing by the user. 
    Since we agree that the current approach has an issue, I'll switch from the 
completely random implementation to the one with prefixes. The next step would 
be to include a presorting on the values for a categorical feature based on 
their respective labels' average (just went back to Breiman's book which is the 
reference cited in the Planet paper, and this is the method stated).



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-5688][MLLIB] Randomize splits for categ...

Reply via email to