shishaochen commented on issue #24470: [SPARK-27577][MLlib] Correct thresholds downsampled in BinaryClassificationMetrics URL: https://github.com/apache/spark/pull/24470#issuecomment-489485003 @srowen Great thanks for your patience! I have added explanation in code comments at [BinaryClassificationMetrics.scala](https://github.com/apache/spark/pull/24470/commits/147f239b47e7317d9a4454820a4e099c98c536dc). Do these following words match your expectations? ```scala counts.mapPartitions(_.grouped(grouping.toInt).map { pairs => // The score of the combined point will be just the last one's score, which is also // the minimal in each chunk since all scores are already sorted in descending. val lastScore = pairs.last._1 // The combined point will contain all counts in this chunk. Thus, calculated // metrics (like precision, recall, etc.) on its score (or so-called threshold) are // the same as those without sampling. val agg = new BinaryLabelCounter() pairs.foreach(pair => agg += pair._2) (lastScore, agg) }) ``` Besides, I have scanned all unit tests and class references in the Spark code repository. None of them uses `numBins` but one unit test [BinaryClassificationMetricsSuite](https://github.com/apache/spark/blob/master/mllib/src/test/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetricsSuite.scala#L172), which only tests the ROC curve without threshold. Thus, it is safe to merge this pull request.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
