[
https://issues.apache.org/jira/browse/SPARK-40455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
caican updated SPARK-40455:
---------------------------
Description:
Here's a very serious bug:
When result stage failed caused by FetchFailedException, the previous
condition to determine whether result stage retries are allowed is
{color:#ff0000}numMissingPartitions < resultStage.numTasks{color}.
If this condition holds on retry, but the other tasks at the current result
stage are not killed, when result stage was resubmit, it would got wrong
partitions to recalculation.
{code:java}
// DAGScheduler#submitMissingTasks
// Figure out the indexes of partition ids to compute.
val partitionsToCompute: Seq[Int] = stage.findMissingPartitions() {code}
It is possible that the number of partitions to be recalculated is smaller than
the actual number of partitions at result stage
was:
Here's a very serious bug:
When result stage failed caused by FetchFailedException, the previous
condition to determine whether result stage retries are allowed is
{color:#FF0000}numMissingPartitions < resultStage.numTasks{color}.
If this condition holds on retry, but the other tasks in the current result
stage are not killed, when result stage was resubmit, it would got wrong
partitions to recalculation.
{code:java}
// DAGScheduler#submitMissingTasks
// Figure out the indexes of partition ids to compute.
val partitionsToCompute: Seq[Int] = stage.findMissingPartitions() {code}
> Abort result stage directly when it failed caused by FetchFailed
> ----------------------------------------------------------------
>
> Key: SPARK-40455
> URL: https://issues.apache.org/jira/browse/SPARK-40455
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core
> Affects Versions: 3.0.0, 3.1.2, 3.2.1, 3.3.0
> Reporter: caican
> Priority: Major
>
> Here's a very serious bug:
> When result stage failed caused by FetchFailedException, the previous
> condition to determine whether result stage retries are allowed is
> {color:#ff0000}numMissingPartitions < resultStage.numTasks{color}.
>
> If this condition holds on retry, but the other tasks at the current result
> stage are not killed, when result stage was resubmit, it would got wrong
> partitions to recalculation.
> {code:java}
> // DAGScheduler#submitMissingTasks
>
> // Figure out the indexes of partition ids to compute.
> val partitionsToCompute: Seq[Int] = stage.findMissingPartitions() {code}
> It is possible that the number of partitions to be recalculated is smaller
> than the actual number of partitions at result stage
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]