zhuangxian created SPARK-49762:
----------------------------------

             Summary: How to handling Task Timeouts and Placeholder Allocation 
in Spark Shuffle Write Phase
                 Key: SPARK-49762
                 URL: https://issues.apache.org/jira/browse/SPARK-49762
             Project: Spark
          Issue Type: Wish
          Components: Spark Core
    Affects Versions: 3.5.1
            Reporter: zhuangxian


During the Spark shuffle write phase, the driver initiates a task to write a 
partition and has allocated a placeholder for the commit to that task. However, 
when dealing with a large volume of data, the task may fail to complete the 
commit task due to network issues or disk failures. In such cases, how should 
the driver detect the task timeout and launch a new task to commit the task for 
the same partition? Additionally, starting a new task raises the following 
issues: 1.Since the placeholder is occupied by the old task, the new task 
cannot obtain the placeholder for submission. How should the new task be 
allocated a placeholder? 2.How can the old task exit safely to ensure it does 
not commit the same data as the new task?

The commit protocol is the 2PC. The main process is placeholder->move->commit, 
[github.com/apache/spark/blob/master/core/src/main/scala/org/…|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala]
 here is the specific implementation.
And the commit algorithm I used is v2.

I tried searching in the history but could not find a solution to this problem. 
I look forward to discussing this issue with community members.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to