DanielCarter-stack commented on issue #10568:
URL: https://github.com/apache/seatunnel/issues/10568#issuecomment-4011109255

   <!-- code-pr-reviewer -->
   Thank you for the detailed analysis and contribution. This is a valid bug in 
the master failover scenario.
   
   The root cause is confirmed in `TaskExecutionService.java:333-339`: when the 
new master restores running jobs via 
`restoreAllRunningJobFromMasterNodeSwitch()` 
(`CoordinatorService.java:449-504`), it creates new `PhysicalVertex` instances 
initialized to `CREATED` state (`PhysicalVertex.java:153-159`). The `DEPLOYING` 
branch in `SubPlan.java:660-676` checks the local `currExecutionState` field 
(via `getExecutionState()` at line 458) and sends duplicate 
`DeployTaskOperation` to workers, which already have the task running in 
`executionContexts`—causing the "TaskGroupLocation already exists" exception.
   
   Your PR #10567's approach of making `deployTask()` idempotent is correct for 
the immediate fix. Please also consider:
   1. Whether `PhysicalVertex.restoreExecutionState()` should sync the actual 
state from IMap instead of only calling `stateProcess()`
   2. Adding integration tests for the rolling restart/failover scenario in 
`TaskExecutionServiceTest.java`
   
   We'll prioritize reviewing #10567.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to