[
https://issues.apache.org/jira/browse/SPARK-40455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
caican updated SPARK-40455:
---------------------------
Description:
Here's a very serious bug:
When result stage failed caused by `FetchFailedException`, the previous
condition to determine whether result stage retries are allowed is
`numMissingPartitions < resultStage.numTasks`.
If this condition holds on retry, but the other tasks in the current result
stage are not killed, when result stage was resubmit, it would got wrong
partitions to recalculation
```
// DAGScheduler#submitMissingTasks
// Figure out the indexes of partition ids to compute.
val partitionsToCompute: Seq[Int] = stage.findMissingPartitions()
```
was:
Here's a very serious bug:
> Abort result stage directly when it failed caused by FetchFailed
> ----------------------------------------------------------------
>
> Key: SPARK-40455
> URL: https://issues.apache.org/jira/browse/SPARK-40455
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core
> Affects Versions: 3.0.0, 3.1.2, 3.2.1, 3.3.0
> Reporter: caican
> Priority: Major
>
> Here's a very serious bug:
> When result stage failed caused by `FetchFailedException`, the previous
> condition to determine whether result stage retries are allowed is
> `numMissingPartitions < resultStage.numTasks`.
>
> If this condition holds on retry, but the other tasks in the current result
> stage are not killed, when result stage was resubmit, it would got wrong
> partitions to recalculation
> ```
> // DAGScheduler#submitMissingTasks
>
> // Figure out the indexes of partition ids to compute.
> val partitionsToCompute: Seq[Int] = stage.findMissingPartitions()
> ```
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]