Kevin Moore created SPARK-32472:
-----------------------------------
Summary: Expose confusion matrix elements by threshold in
BinaryClassificationMetrics
Key: SPARK-32472
URL: https://issues.apache.org/jira/browse/SPARK-32472
Project: Spark
Issue Type: Improvement
Components: MLlib
Affects Versions: 3.0.0
Reporter: Kevin Moore
Currently, the only thresholded metrics available from
BinaryClassificationMetrics are precision, recall, f-measure, and (indirectly
through `roc()`) the false positive rate.
Unfortunately, you can't always compute the individual thresholded confusion
matrix elements (TP, FP, TN, FN) from these quantities. You can make a system
of equations out of the existing thresholded metrics and the total count, but
they become underdetermined when there are no true positives.
Fortunately, the individual confusion matrix elements by threshold are already
computed and sitting in the `confusions` variable. It would be helpful to
expose these elements directly. The easiest way would probably be by adding
methods like `def truePositivesByThreshold(): RDD[(Double, Double)] =
confusions.map\{ case (t, c) => (t, c.weightedTruePositives) }`.
An alternative could be to expose the entire `RDD[(Double,
BinaryConfusionMatrix)]` in one method, but `BinaryConfusionMatrix` is also
currently package private.
The closest issue to this I found was this one for adding new calculations to
BinaryClassificationMetrics https://issues.apache.org/jira/browse/SPARK-18844,
which was closed without any changes being merged.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]