Github user edenovit commented on the pull request:
https://github.com/apache/spark/pull/4475#issuecomment-73584856
I chose that approach since there was no preprocessing within the library,
so I assumed that choosing uniformly at random would be the best option.
However, as you mentioned, this approach also kills the option to do any
preprocessing by the user.
Since we agree that the current approach has an issue, I'll switch from the
completely random implementation to the one with prefixes. The next step would
be to include a presorting on the values for a categorical feature based on
their respective labels' average (just went back to Breiman's book which is the
reference cited in the Planet paper, and this is the method stated).
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]