[ 
https://issues.apache.org/jira/browse/SPARK-27577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-27577.
-------------------------------
       Resolution: Fixed
         Assignee: Shaochen Shi
    Fix Version/s: 2.4.4
                   3.0.0
                   2.3.4

Resolved by https://github.com/apache/spark/pull/24470

> Wrong thresholds selected by BinaryClassificationMetrics when downsampling
> --------------------------------------------------------------------------
>
>                 Key: SPARK-27577
>                 URL: https://issues.apache.org/jira/browse/SPARK-27577
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0, 2.4.1, 2.4.2, 3.0.0
>            Reporter: Shaochen Shi
>            Assignee: Shaochen Shi
>            Priority: Minor
>             Fix For: 2.3.4, 3.0.0, 2.4.4
>
>
> In binary metrics, a threshold means any instance with a score >= threshold 
> will be considered as positive.
> However, in the existing implementation:
>  # When `numBins` is set when creating a `BinaryClassificationMetrics` 
> object, all records (ordered by scores in DESC) will be grouped into chunks.
>  # In each chunk, statistics (in `BinaryLabelCounter`) of records are 
> accumulated while the first record's score (also the largest) is selected as 
> threshold.
>  # All these generated/sampled records form a new smaller data set to 
> calculate binary metrics.
> At the second step, it brings the BUG that the score/threshold of a record is 
> correlated with wrong values like larger `true positive`, smaller `false 
> negative` when calculating `recallByThresholds`, `precisionByThresholds`, etc.
> Thus, the BUG fix is straightfoward. Let's pick up the last records's core in 
> all chunks as thresholds while statistics merged.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to