otterc opened a new pull request, #36293:
URL: https://github.com/apache/spark/pull/36293

   ### What changes were proposed in this pull request?
   This change fixes the scenarios where a stage re-attempt doesn't complete 
successfully, even though all the tasks complete when push-based shuffle is 
enabled.  With Adaptive Merge Finalization, a stage may be triggered for 
finalization when it is the below state:
   - The stage is not running (not in the running set of the DAGScheduler) - 
had failed or canceled or waiting, and
   - The stage has no pending partitions (all the tasks completed at-least once)
   
   For such a stage when the finalization completes, the stage will still not 
be marked as mergeFinalized. 
   The stage of the stage will be: 
   - `stage.shuffleDependency.mergeFinalized = false`
   - `stage.shuffleDependency.getFinalizeTask != Nil`
   - Merged statuses of the state are unregistered
    
   When the stage is resubmitted, the newer attempt of the stage will never 
complete even though its tasks may be completed. This is because the newer 
attempt of the stage will have `shuffleMergeEnabled = true`, since with the 
previous attempt the stage was never marked as mergedFinalized, and the 
finalizeTask is present (from finalization attempt for previous stage attempt).
   
    So, when all the tasks of the newer attempt complete, then these conditions 
will be true:
   - stage will be running
   - There will be no pending partitions since all the tasks completed
   - `stage.shuffleDependency.shuffleMergeEnabled = true`
   - `stage.shuffleDependency.shuffleMergeFinalized = false`
   - `stage.shuffleDependency.getFinalizeTask` is `not empty`
   This leads the DAGScheduler to try scheduling finalization and not trigger 
the completion of the Stage. However because of the last condition it never 
even schedules the finalization and the stage never completes.
   
   ### Why are the changes needed?
   The change fixes the above issue where the application gets stalled as some 
stages don't complete successfully.
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   I have just modified the existing UT. A stage will be marked finalized 
irrespective of its state and for deterministic stage we don't want to 
unregister merge results.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to