shishaochen commented on issue #24470: [SPARK-27577][MLlib] Correct thresholds 
downsampled in BinaryClassificationMetrics
URL: https://github.com/apache/spark/pull/24470#issuecomment-489287434
 
 
   @srowen Yes, both are approximations. But it has less error if we choose the 
last element in each chunk as the threshold.
   And the essential problem is that, the so-called "downsampling" is not real 
sampling. The [code 
behind](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/evaluation/BinaryClassificationMetrics.scala#L196)
 calculates AUC based on statistics (like TP, NP, TF, NF) of all elements.
   ```scala
   counts.mapPartitions(_.grouped(grouping.toInt).map { pairs =>
     // The score of the combined point will be just the first one's score
     val firstScore = pairs.head._1
     // The point will contain all counts in this chunk
     val agg = new BinaryLabelCounter()
     pairs.foreach(pair => agg += pair._2)
     (firstScore, agg)
   })
   ```
   You can see, counters (`BinaryLabelCounter`) of all elements are merged into 
one instead of return the first element directly.
   Thus, from the definition of `threshold`, the score of the last element 
(which is the minimal one) is the right threshold to use when inference.
   In online systems, we need choose the right threshold to predict whether an 
instance is positive (`score>=threshold`) or negative (`score<threshold`).
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to