[ 
https://issues.apache.org/jira/browse/SPARK-24431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503466#comment-16503466
 ] 

Sean Owen commented on SPARK-24431:
-----------------------------------

So, the model makes the same prediction p for all examples? In that case, for 
points on the PR curve corresponding to thresholds less than p, everything is 
classified positive. Recall is 1 and precision is the fraction of examples that 
are actually positive. For thresholds greater than p, all is marked negative. 
Recall is 0, but, really precision is undefined (0/0). Yes, see SPARK-21806 for 
a discussion of the different ways you could try to handle this – no point for 
recall = 0, or precision = 0, 1, or p.

We (mostly I) settled on the latter as the least surprising change and one most 
likely to produce model comparisons that make intuitive sense. My argument 
against precision = 0 at recall = 0 is that it doesn't quite make sense that 
precision drops as recall decreases, and that would define precision as the 
smallest possible value.

You're right that this is a corner case, and there is no really correct way to 
handle it. I'd say the issue here is perhaps leaning too much on the area under 
the PR curve. It doesn't have as much meaning as the area under the ROC curve. 
I think it's just the extreme example of the problem with comparing precision 
and recall?

> wrong areaUnderPR calculation in BinaryClassificationEvaluator 
> ---------------------------------------------------------------
>
>                 Key: SPARK-24431
>                 URL: https://issues.apache.org/jira/browse/SPARK-24431
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 2.2.0
>            Reporter: Xinyong Tian
>            Priority: Major
>
> My problem, I am using CrossValidator(estimator=LogisticRegression(...), ..., 
>  evaluator=BinaryClassificationEvaluator(metricName='areaUnderPR'))  to 
> select best model. when the regParam in logistict regression is very high, no 
> variable will be selected (no model), ie every row 's prediction is same ,eg. 
> equal event rate (baseline frequency). But at this point,  
> BinaryClassificationEvaluator set the areaUnderPR highest. As a result  best 
> model seleted is a no model. 
> the reason is following.  at time of no model, precision recall curve will be 
> only two points: at recall =0, precision should be set to  zero , while the 
> software set it to 1. at recall=1, precision is the event rate. As a result, 
> the areaUnderPR will be close 0.5 (my even rate is very low), which is 
> maximum .
> the solution is to set precision =0 when recall =0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to