[
https://issues.apache.org/jira/browse/SPARK-24375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16503647#comment-16503647
]
Jiang Xingbo commented on SPARK-24375:
--------------------------------------
The major problem is that tasks in the same stage of a MPI workload may rely on
the internal results of other parallel running folk tasks to compute the final
results, thus when a task fail, other tasks in the same stage may generate
incorrect result or even hang, and it seems to be straight-forward to just
retry the whole stage on task failure.
> Design sketch: support barrier scheduling in Apache Spark
> ---------------------------------------------------------
>
> Key: SPARK-24375
> URL: https://issues.apache.org/jira/browse/SPARK-24375
> Project: Spark
> Issue Type: Story
> Components: Spark Core
> Affects Versions: 3.0.0
> Reporter: Xiangrui Meng
> Assignee: Jiang Xingbo
> Priority: Major
>
> This task is to outline a design sketch for the barrier scheduling SPIP
> discussion. It doesn't need to be a complete design before the vote. But it
> should at least cover both Scala/Java and PySpark.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]