[jira] [Commented] (SPARK-21806) BinaryClassificationMetrics pr(): first point (0.0, 1.0) is misleading

Sean Owen (JIRA) Tue, 22 Aug 2017 04:01:44 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-21806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16136639#comment-16136639
 ]


Sean Owen commented on SPARK-21806:
-----------------------------------

Agree, I think the goal has been to stay consistent with commonly-used 
libraries where the right behavior is ambiguous.

I think it's a smaller change to keep a point for recall = 0 in all cases but 
give it a better value. I'd support that as right now the result has 
potentially two values for recall = 0, which is already wrong.

If that's consistent with something in R, good. Ignoring the curve from [0, 
min(recall)] sounds like a bigger change, so I'd hesitate to adopt that even if 
scikit does.

> BinaryClassificationMetrics pr(): first point (0.0, 1.0) is misleading
> ----------------------------------------------------------------------
>
>                 Key: SPARK-21806
>                 URL: https://issues.apache.org/jira/browse/SPARK-21806
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 2.2.0
>            Reporter: Marc Kaminski
>            Priority: Minor
>         Attachments: PRROC_example.jpeg
>
>
> I would like to reference to a [discussion in scikit-learn| 
> https://github.com/scikit-learn/scikit-learn/issues/4223], as this behavior 
> is probably based on the scikit implementation. 
> Summary: 
> Currently, the y-axis intercept of the precision recall curve is set to (0.0, 
> 1.0). This behavior is not ideal in certain edge cases (see example below) 
> and can also have an impact on cross validation, when optimization metric is 
> set to "areaUnderPR". 
> Please consider [blucena's 
> post|https://github.com/scikit-learn/scikit-learn/issues/4223#issuecomment-215273613]
>  for possible alternatives. 
> Edge case example: 
> Consider a bad classifier, that assigns a high probability to all samples. A 
> possible output might look like this: 
> ||Real label || Score ||
> |1.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 1.0 |
> |0.0 | 0.95 |
> |0.0 | 0.95 |
> |1.0 | 1.0 |
> This results in the following pr points (first line set by default): 
> ||Threshold || Recall ||Precision ||
> |1.0 | 0.0 | 1.0 | 
> |0.95| 1.0 | 0.2 |
> |0.0| 1.0 | 0,16 |
> The auPRC would be around 0.6. Classifiers with a more differentiated 
> probability assignment  will be falsely assumed to perform worse in regard to 
> this auPRC.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-21806) BinaryClassificationMetrics pr(): first point (0.0, 1.0) is misleading

Reply via email to