[GitHub] spark pull request #21577: [SPARK-24552][core] Correctly identify tasks in o...

vanzin Mon, 18 Jun 2018 15:42:20 -0700

Github user vanzin commented on a diff in the pull request:

    https://github.com/apache/spark/pull/21577#discussion_r196247811
  
    --- Diff: 
core/src/main/scala/org/apache/spark/scheduler/OutputCommitCoordinator.scala ---
    @@ -109,20 +116,21 @@ private[spark] class OutputCommitCoordinator(conf: 
SparkConf, isDriver: Boolean)
        * @param maxPartitionId the maximum partition id that could appear in 
this stage's tasks (i.e.
        *                       the maximum possible value of 
`context.partitionId`).
        */
    -  private[scheduler] def stageStart(stage: StageId, maxPartitionId: Int): 
Unit = synchronized {
    +  private[scheduler] def stageStart(stage: Int, maxPartitionId: Int): Unit 
= synchronized {
         stageStates(stage) = new StageState(maxPartitionId + 1)
    --- End diff --
    
    I don't think the semantics are changing. It's always been racy, in that 
either of the concurrent tasks from different stage attempts may succeed first. 
And I'm almost sure the assumption is that both task attempts are equivalent 
(i.e. the output is deterministic or at least should be), so it should be fine 
for either to be committed.
    
    The problem is that without this change the coordinator would allow both 
attempts to commit, and that is kinda bad.



---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #21577: [SPARK-24552][core] Correctly identify tasks in o...

Reply via email to