sidshas03 commented on PR #63355:
URL: https://github.com/apache/airflow/pull/63355#issuecomment-4042168240

   > > 409 invalid_state is coming when same TI gets terminal `SUCCESS` update 
more than once. In this case, by the time task runner sends final SUCCESS, DB 
already has SUCCESS (`previous_state=success`), so API rejects the second write.
   > 
   > Yeah, that's what I meant, i.e. let's find out the root cause first and 
not add band-aid -- what's causing task to succeed before the worker reports in 
the first place.
   
   I traced this deeper and found the root-cause path.
   
   This is primarily a scheduler-side HA race, not only a task-runner 
finalization issue.
   
   In `DagRun.schedule_tis()` (`airflow-core/src/airflow/models/dagrun.py`), 
the scheduling update is keyed by TI id and can run from a stale scheduler 
view, which allows duplicate scheduling / try bump for the same TI under 
contention.  
   That means the same TI can be enqueued as try 1 and try 2.
   
   Then one attempt finishes first and sets TI to `success`.  
   When the other attempt later reports terminal state, execution API correctly 
rejects it with `409 invalid_state` / `previous_state=success` 
(`airflow-core/src/airflow/api_fastapi/execution_api/routes/task_instances.py`).
   
   This matches the observed ordering in related HA logs:
   - enqueue try 1
   - enqueue try 2 for same TI
   - try 2 runs
   - DagRun marked success
   - late try 1 report hits invalid_state conflict
   
   So the band-aid behavior in this PR avoids task-process failure for 
duplicate-success report, but the underlying cause is the scheduler HA race on 
TI scheduling/ownership.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to