ricky2129 commented on code in PR #10567:
URL: https://github.com/apache/seatunnel/pull/10567#discussion_r2916261289


##########
seatunnel-engine/seatunnel-engine-server/src/main/java/org/apache/seatunnel/engine/server/dag/physical/PhysicalVertex.java:
##########
@@ -206,6 +206,18 @@ public PassiveCompletableFuture<TaskExecutionState> 
initStateFuture() {
 
     public void restoreExecutionState() {
         startPhysicalVertex();
+        // Re-sync currExecutionState from IMap before processing. During 
master failover,
+        // IMap may still say DEPLOYING if the master crashed after the worker 
finished
+        // deploying but before the master processed NotifyTaskStatusOperation 
and updated
+        // the IMap to RUNNING. In that case, check whether the task is 
actually executing
+        // on the worker and advance the state to RUNNING so stateProcess() 
does not send
+        // a redundant DeployTaskOperation.
+        this.currExecutionState = (ExecutionState) 
runningJobStateIMap.get(taskGroupLocation);
+        if (ExecutionState.DEPLOYING.equals(currExecutionState)) {
+            if (checkTaskGroupIsExecuting(taskGroupLocation)) {
+                updateTaskState(ExecutionState.RUNNING);
+            }
+        }

Review Comment:
   You're right that fix 1 alone would recover the job — the redundant 
\`DeployTaskOperation\` would succeed and the 
     state machine would advance to \`RUNNING\` via the normal DEPLOYING 
branch.                                                                         
                  
                                                                                
                                                                                
           
     Fix 2 is kept for two reasons:                                             
                                                                                
           
                                                                                
                                                                                
           
     1. **Avoids unnecessary work**: Without it, master sends a full 
\`DeployTaskOperation\` (serialized task group data, network round-trip) for 
every task already
     running. With fix 2 the master detects the real state first and skips the 
deploy entirely.
   
     2. **Safer slot profile lookup**: \`deploy()\` calls 
\`jobMaster.getOwnedSlotProfiles(taskGroupLocation)\`. During restore, slot 
profiles are being reconstructed from
      IMap; if that lookup returns stale or incomplete data the deploy fails 
and the task is marked FAILING unnecessarily. Fix 2 sidesteps this path when 
not needed.
   
     That said — if you prefer to keep this PR minimal (fix 1 only), fix 2 can 
be dropped. The job will still recover correctly; it is purely a 
defense-in-depth
     optimization.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to