zhuangxian created SPARK-49762:
----------------------------------
Summary: How to handling Task Timeouts and Placeholder Allocation
in Spark Shuffle Write Phase
Key: SPARK-49762
URL: https://issues.apache.org/jira/browse/SPARK-49762
Project: Spark
Issue Type: Wish
Components: Spark Core
Affects Versions: 3.5.1
Reporter: zhuangxian
During the Spark shuffle write phase, the driver initiates a task to write a
partition and has allocated a placeholder for the commit to that task. However,
when dealing with a large volume of data, the task may fail to complete the
commit task due to network issues or disk failures. In such cases, how should
the driver detect the task timeout and launch a new task to commit the task for
the same partition? Additionally, starting a new task raises the following
issues: 1.Since the placeholder is occupied by the old task, the new task
cannot obtain the placeholder for submission. How should the new task be
allocated a placeholder? 2.How can the old task exit safely to ensure it does
not commit the same data as the new task?
The commit protocol is the 2PC. The main process is placeholder->move->commit,
[github.com/apache/spark/blob/master/core/src/main/scala/org/…|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala]
here is the specific implementation.
And the commit algorithm I used is v2.
I tried searching in the history but could not find a solution to this problem.
I look forward to discussing this issue with community members.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]