[ 
https://issues.apache.org/jira/browse/SPARK-32472?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Moore updated SPARK-32472:
--------------------------------
    Description: 
Currently, the only thresholded metrics available from 
BinaryClassificationMetrics are precision, recall, f-measure, and (indirectly 
through roc()) the false positive rate.

Unfortunately, you can't always compute the individual thresholded confusion 
matrix elements (TP, FP, TN, FN) from these quantities. You can make a system 
of equations out of the existing thresholded metrics and the total count, but 
they become underdetermined when there are no true positives.

Fortunately, the individual confusion matrix elements by threshold are already 
computed and sitting in the `confusions` variable. It would be helpful to 
expose these elements directly. The easiest way would probably be by adding 
methods like

 
{code:java}
// def truePositivesByThreshold(): RDD[(Double, Double)] = confusions.map{ case 
(t, c) => (t, c.weightedTruePositives) }{code}
 

An alternative could be to expose the entire RDD[(Double, 
BinaryConfusionMatrix)] in one method, but BinaryConfusionMatrix is also 
currently package private.

The closest issue to this I found was this one for adding new calculations to 
BinaryClassificationMetrics https://issues.apache.org/jira/browse/SPARK-18844, 
which was closed without any changes being merged.

  was:
Currently, the only thresholded metrics available from 
BinaryClassificationMetrics are precision, recall, f-measure, and (indirectly 
through `roc()`) the false positive rate.

Unfortunately, you can't always compute the individual thresholded confusion 
matrix elements (TP, FP, TN, FN) from these quantities. You can make a system 
of equations out of the existing thresholded metrics and the total count, but 
they become underdetermined when there are no true positives.

Fortunately, the individual confusion matrix elements by threshold are already 
computed and sitting in the `confusions` variable. It would be helpful to 
expose these elements directly. The easiest way would probably be by adding 
methods like `def truePositivesByThreshold(): RDD[(Double, Double)] = 
confusions.map\{ case (t, c) => (t, c.weightedTruePositives) }`.

An alternative could be to expose the entire `RDD[(Double, 
BinaryConfusionMatrix)]` in one method, but `BinaryConfusionMatrix` is also 
currently package private.

The closest issue to this I found was this one for adding new calculations to 
BinaryClassificationMetrics https://issues.apache.org/jira/browse/SPARK-18844, 
which was closed without any changes being merged.


> Expose confusion matrix elements by threshold in BinaryClassificationMetrics
> ----------------------------------------------------------------------------
>
>                 Key: SPARK-32472
>                 URL: https://issues.apache.org/jira/browse/SPARK-32472
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 3.0.0
>            Reporter: Kevin Moore
>            Priority: Minor
>
> Currently, the only thresholded metrics available from 
> BinaryClassificationMetrics are precision, recall, f-measure, and (indirectly 
> through roc()) the false positive rate.
> Unfortunately, you can't always compute the individual thresholded confusion 
> matrix elements (TP, FP, TN, FN) from these quantities. You can make a system 
> of equations out of the existing thresholded metrics and the total count, but 
> they become underdetermined when there are no true positives.
> Fortunately, the individual confusion matrix elements by threshold are 
> already computed and sitting in the `confusions` variable. It would be 
> helpful to expose these elements directly. The easiest way would probably be 
> by adding methods like
>  
> {code:java}
> // def truePositivesByThreshold(): RDD[(Double, Double)] = confusions.map{ 
> case (t, c) => (t, c.weightedTruePositives) }{code}
>  
> An alternative could be to expose the entire RDD[(Double, 
> BinaryConfusionMatrix)] in one method, but BinaryConfusionMatrix is also 
> currently package private.
> The closest issue to this I found was this one for adding new calculations to 
> BinaryClassificationMetrics 
> https://issues.apache.org/jira/browse/SPARK-18844, which was closed without 
> any changes being merged.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to