[GitHub] [spark] shishaochen commented on issue #24470: [SPARK-27577][MLlib] Correct thresholds downsampled in BinaryClassificationMetrics

GitBox Sun, 05 May 2019 19:16:36 -0700

shishaochen commented on issue #24470: [SPARK-27577][MLlib] Correct thresholds 
downsampled in BinaryClassificationMetrics
URL: https://github.com/apache/spark/pull/24470#issuecomment-489485003
 
 
   @srowen Great thanks for your patience!
   I have added explanation in code comments at 
[BinaryClassificationMetrics.scala](https://github.com/apache/spark/pull/24470/commits/147f239b47e7317d9a4454820a4e099c98c536dc).
 Do these following words match your expectations?
   ```scala
   counts.mapPartitions(_.grouped(grouping.toInt).map { pairs =>
     // The score of the combined point will be just the last one's score, 
which is also
     // the minimal in each chunk since all scores are already sorted in 
descending.
     val lastScore = pairs.last._1
     // The combined point will contain all counts in this chunk. Thus, 
calculated
     // metrics (like precision, recall, etc.) on its score (or so-called 
threshold) are
     // the same as those without sampling.
     val agg = new BinaryLabelCounter()
     pairs.foreach(pair => agg += pair._2)
     (lastScore, agg)
   })
   ```
   Besides, I have scanned all unit tests and class references in the Spark 
code repository. None of them uses `numBins` but one unit test 
[BinaryClassificationMetricsSuite](https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetricsSuite.scala#L172),
 which only tests the ROC curve without threshold. Thus, it is safe to merge 
this pull request.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] shishaochen commented on issue #24470: [SPARK-27577][MLlib] Correct thresholds downsampled in BinaryClassificationMetrics

Reply via email to