ricky2129 commented on code in PR #10567:
URL: https://github.com/apache/seatunnel/pull/10567#discussion_r2916261289
##########
seatunnel-engine/seatunnel-engine-server/src/main/java/org/apache/seatunnel/engine/server/dag/physical/PhysicalVertex.java:
##########
@@ -206,6 +206,18 @@ public PassiveCompletableFuture<TaskExecutionState>
initStateFuture() {
public void restoreExecutionState() {
startPhysicalVertex();
+ // Re-sync currExecutionState from IMap before processing. During
master failover,
+ // IMap may still say DEPLOYING if the master crashed after the worker
finished
+ // deploying but before the master processed NotifyTaskStatusOperation
and updated
+ // the IMap to RUNNING. In that case, check whether the task is
actually executing
+ // on the worker and advance the state to RUNNING so stateProcess()
does not send
+ // a redundant DeployTaskOperation.
+ this.currExecutionState = (ExecutionState)
runningJobStateIMap.get(taskGroupLocation);
+ if (ExecutionState.DEPLOYING.equals(currExecutionState)) {
+ if (checkTaskGroupIsExecuting(taskGroupLocation)) {
+ updateTaskState(ExecutionState.RUNNING);
+ }
+ }
Review Comment:
You're right that fix 1 alone would recover the job — the redundant
\`DeployTaskOperation\` would succeed and the
state machine would advance to \`RUNNING\` via the normal DEPLOYING
branch.
Fix 2 is kept for two reasons:
1. **Avoids unnecessary work**: Without it, master sends a full
\`DeployTaskOperation\` (serialized task group data, network round-trip) for
every task already
running. With fix 2 the master detects the real state first and skips the
deploy entirely.
2. **Safer slot profile lookup**: \`deploy()\` calls
\`jobMaster.getOwnedSlotProfiles(taskGroupLocation)\`. During restore, slot
profiles are being reconstructed from
IMap; if that lookup returns stale or incomplete data the deploy fails
and the task is marked FAILING unnecessarily. Fix 2 sidesteps this path when
not needed.
That said — if you prefer to keep this PR minimal (fix 1 only), fix 2 can
be dropped. The job will still recover correctly; it is purely a
defense-in-depth
optimization.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]