Github user thunterdb commented on the pull request:
https://github.com/apache/spark/pull/12663#issuecomment-215091481
A quick look at the source code of scikit-learn shows that it always
reindexes, but it uses some efficient numpy primitive for doing that. I think
assuming an index for small integers is an acceptable tradeoff for the users
(especially in the binary case).
@jkbradley what happens when a class label is missing from the dataset? I
presume this is not a cause for concern?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]