Marc Kaminski created SPARK-21806:
-------------------------------------

             Summary: BinaryClassificationMetrics pr(): first point (0.0, 1.0) 
is misleading
                 Key: SPARK-21806
                 URL: https://issues.apache.org/jira/browse/SPARK-21806
             Project: Spark
          Issue Type: Improvement
          Components: MLlib
    Affects Versions: 2.2.0
            Reporter: Marc Kaminski
            Priority: Minor


I would like to reference to a [discussion in scikit-learn| 
https://github.com/scikit-learn/scikit-learn/issues/4223], as this behavior is 
probably based on the scikit implementation. 

Summary: 
Currently, the y-axis intercept of the precision recall curve is set to (0.0, 
1.0). This behavior is not ideal in certain edge cases (see example below) and 
can also have an impact on cross validation, when optimization metric is set to 
"areaUnderPR". 

Please consider [blucena's 
post|https://github.com/scikit-learn/scikit-learn/issues/4223#issuecomment-215273613]
 for possible alternatives. 

Edge case example: 
Consider a bad classifier, that assigns a high probability to all samples. A 
possible output might look like this: 

||Real label || Score ||
|1.0 | 1.0 |
|1.0 | 1.0 |
|1.0 | 1.0 |
|1.0 | 1.0 |

 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to