[
https://issues.apache.org/jira/browse/SPARK-21806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16136639#comment-16136639
]
Sean Owen commented on SPARK-21806:
-----------------------------------
Agree, I think the goal has been to stay consistent with commonly-used
libraries where the right behavior is ambiguous.
I think it's a smaller change to keep a point for recall = 0 in all cases but
give it a better value. I'd support that as right now the result has
potentially two values for recall = 0, which is already wrong.
If that's consistent with something in R, good. Ignoring the curve from [0,
min(recall)] sounds like a bigger change, so I'd hesitate to adopt that even if
scikit does.
> BinaryClassificationMetrics pr(): first point (0.0, 1.0) is misleading
> ----------------------------------------------------------------------
>
> Key: SPARK-21806
> URL: https://issues.apache.org/jira/browse/SPARK-21806
> Project: Spark
> Issue Type: Improvement
> Components: MLlib
> Affects Versions: 2.2.0
> Reporter: Marc Kaminski
> Priority: Minor
> Attachments: PRROC_example.jpeg
>
>
> I would like to reference to a [discussion in scikit-learn|
> https://github.com/scikit-learn/scikit-learn/issues/4223], as this behavior
> is probably based on the scikit implementation.
> Summary:
> Currently, the y-axis intercept of the precision recall curve is set to (0.0,
> 1.0). This behavior is not ideal in certain edge cases (see example below)
> and can also have an impact on cross validation, when optimization metric is
> set to "areaUnderPR".
> Please consider [blucena's
> post|https://github.com/scikit-learn/scikit-learn/issues/4223#issuecomment-215273613]
> for possible alternatives.
> Edge case example:
> Consider a bad classifier, that assigns a high probability to all samples. A
> possible output might look like this:
> ||Real label || Score ||
> |1.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 0.95 |
> |0.0 | 0.95 |
> |1.0 | 1.0 |
> This results in the following pr points (first line set by default):
> ||Threshold || Recall ||Precision ||
> |1.0 | 0.0 | 1.0 |
> |0.95| 1.0 | 0.2 |
> |0.0| 1.0 | 0,16 |
> The auPRC would be around 0.6. Classifiers with a more differentiated
> probability assignment will be falsely assumed to perform worse in regard to
> this auPRC.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]