Github user jkbradley commented on the pull request:
https://github.com/apache/spark/pull/3702#issuecomment-68286086
@srowen Sorry for the delay! I'm really starting to wonder about this
JIRA, though. The collect() should return one BinaryLabelCounter per
partition. I'd assume people would have enough memory to store at least a few
million BinaryLabelCounter instances on the driver. Does that mean they have
more than a few million partitions?
Sorry I didn't think about this earlier, and perhaps I'm just confusing
myself now---let me know what you think. Is there an issue to solve here?
Previously, I'd have said: "With the update, this LGTM"
Also, I did think of one use case which may change things: We've been
talking about people using these methods to make plots. Do you think people
ever use them to choose thresholds? If so, then people might want much
finer-grained ROC curves than we've been thinking, and it might be worthwhile
to do a fancy implementation which avoids binning.
At any rate, apologies for so much back-and-forth.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]