[
https://issues.apache.org/jira/browse/SPARK-27577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Shaochen Shi updated SPARK-27577:
---------------------------------
Affects Version/s: 3.0.0
> Wrong thresholds selected by BinaryClassificationMetrics when downsampling
> --------------------------------------------------------------------------
>
> Key: SPARK-27577
> URL: https://issues.apache.org/jira/browse/SPARK-27577
> Project: Spark
> Issue Type: Bug
> Components: MLlib
> Affects Versions: 1.3.0, 1.3.1, 1.4.0, 1.4.1, 1.5.0, 1.5.1, 1.5.2, 1.6.0,
> 1.6.1, 1.6.2, 1.6.3, 2.0.0, 2.0.1, 2.0.2, 2.1.0, 2.1.1, 2.1.2, 2.1.3, 2.2.0,
> 2.2.1, 2.2.2, 2.2.3, 2.3.0, 2.3.1, 2.3.2, 2.3.3, 2.4.0, 2.4.1, 2.4.2, 3.0.0
> Reporter: Shaochen Shi
> Priority: Critical
> Labels: Correctness
>
> In binary metrics, a threshold means any instance with a score >= threshold
> will be considered as positive.
> However, in the existing implementation:
> # When `numBins` is set when creating a `BinaryClassificationMetrics`
> object, all records (ordered by scores in DESC) will be grouped into chunks.
> # In each chunk, statistics (in `BinaryLabelCounter`) of records are
> accumulated while the first record's score is selected as threshold.
> # All these generated/sampled records form a new smaller data set to
> calculate binary metrics.
> At the second step, it brings the BUG that the score/threshold of a record is
> correlated with wrong values like larger `true positive`, smaller `false
> negative` when calculating `recallByThresholds`, `precisionByThresholds`, etc.
> Thus, the BUG fix is straightfoward. Let's pick up the last records's core in
> all chunks as thresholds while statistics merged.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]