Github user srowen commented on a diff in the pull request:
https://github.com/apache/spark/pull/3702#discussion_r22193166
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala
---
@@ -28,9 +28,23 @@ import org.apache.spark.rdd.{RDD, UnionRDD}
* Evaluator for binary classification.
*
* @param scoreAndLabels an RDD of (score, label) pairs.
+ * @param numBins if greater than 0, then the curves (ROC curve, PR curve)
computed internally
+ * will be down-sampled to this many "bins". This is useful because the
curve contains a
+ * point for each distinct score in the input, and this could be as large
as the input itself --
+ * millions of points or more, when thousands may be entirely sufficient
to summarize the curve.
+ * After down-sampling, the curves will instead be made of approximately
`numBins` points instead.
+ * Points are made from bins of equal numbers of consecutive points. The
size of each bin
+ * is `floor(scoreAndLabels.count() / numBins)`, which means the
resulting number of bins
+ * may not exactly equal numBins. The last bin in each partition may be
smaller as a result,
+ * meaning there may be an extra sample at partition boundaries.
+ * If `numBins` is 0, no down-sampling will occur.
*/
@Experimental
-class BinaryClassificationMetrics(scoreAndLabels: RDD[(Double, Double)])
extends Logging {
+class BinaryClassificationMetrics(
+ val scoreAndLabels: RDD[(Double, Double)],
+ val numBins: Int = 0) extends Logging {
--- End diff --
Ah probably. What about just adding a setter?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]