Sean Owen created SPARK-4547:
--------------------------------
Summary: OOM when making bins in BinaryClassificationMetrics
Key: SPARK-4547
URL: https://issues.apache.org/jira/browse/SPARK-4547
Project: Spark
Issue Type: Bug
Components: MLlib
Affects Versions: 1.1.0
Reporter: Sean Owen
Priority: Minor
Also following up on
http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/%3CCAMAsSdK4s4TNkf3_ecLC6yD-pLpys_PpT3WB7Tp6=yoxuxf...@mail.gmail.com%3E
-- this one I intend to make a PR for a bit later. The conversation was
basically:
{quote}
Recently I was using BinaryClassificationMetrics to build an AUC curve for a
classifier over a reasonably large number of points (~12M). The scores were all
probabilities, so tended to be almost entirely unique.
The computation does some operations by key, and this ran out of memory. It's
something you can solve with more than the default amount of memory, but in
this case, it seemed unuseful to create an AUC curve with such fine-grained
resolution.
I ended up just binning the scores so there were ~1000 unique values
and then it was fine.
{quote}
and:
{quote}
Yes, if there are many distinct values, we need binning to compute the AUC
curve. Usually, the scores are not evenly distribution, we cannot simply
truncate the digits. Estimating the quantiles for binning is necessary, similar
to RangePartitioner:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/Partitioner.scala#L104
Limiting the number of bins is definitely useful.
{quote}
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]