sidshas03 commented on PR #63355:
URL: https://github.com/apache/airflow/pull/63355#issuecomment-4048895730

   > > > > 409 invalid_state is coming when same TI gets terminal `SUCCESS` 
update more than once. In this case, by the time task runner sends final 
SUCCESS, DB already has SUCCESS (`previous_state=success`), so API rejects the 
second write.
   > > > 
   > > > 
   > > > Yeah, that's what I meant, i.e. let's find out the root cause first 
and not add band-aid -- what's causing task to succeed before the worker 
reports in the first place.
   > > 
   > > 
   > > I traced this deeper and found the root-cause path.
   > > This is primarily a scheduler-side HA race, not only a task-runner 
finalization issue.
   > > In `DagRun.schedule_tis()` 
(`airflow-core/src/airflow/models/dagrun.py`), the scheduling update is keyed 
by TI id and can run from a stale scheduler view, which allows duplicate 
scheduling / try bump for the same TI under contention. That means the same TI 
can be enqueued as try 1 and try 2.
   > > Then one attempt finishes first and sets TI to `success`. When the other 
attempt later reports terminal state, execution API correctly rejects it with 
`409 invalid_state` / `previous_state=success` 
(`airflow-core/src/airflow/api_fastapi/execution_api/routes/task_instances.py`).
   > 
   > If you say that tghe issue is due to a stale view, maybe this can be 
solved by an exclusive lock? or if the stale view comes from ORM cache, maybe 
we can clear the cache before running the query?
   > 
   > as this probably should also be part of the critical section of the main 
scheduler loop, what do you think?
   
   
   
   > > > > 409 invalid_state is coming when same TI gets terminal `SUCCESS` 
update more than once. In this case, by the time task runner sends final 
SUCCESS, DB already has SUCCESS (`previous_state=success`), so API rejects the 
second write.
   > > > 
   > > > 
   > > > Yeah, that's what I meant, i.e. let's find out the root cause first 
and not add band-aid -- what's causing task to succeed before the worker 
reports in the first place.
   > > 
   > > 
   > > I traced this deeper and found the root-cause path.
   > > This is primarily a scheduler-side HA race, not only a task-runner 
finalization issue.
   > > In `DagRun.schedule_tis()` 
(`airflow-core/src/airflow/models/dagrun.py`), the scheduling update is keyed 
by TI id and can run from a stale scheduler view, which allows duplicate 
scheduling / try bump for the same TI under contention. That means the same TI 
can be enqueued as try 1 and try 2.
   > > Then one attempt finishes first and sets TI to `success`. When the other 
attempt later reports terminal state, execution API correctly rejects it with 
`409 invalid_state` / `previous_state=success` 
(`airflow-core/src/airflow/api_fastapi/execution_api/routes/task_instances.py`).
   > 
   > If you say that tghe issue is due to a stale view, maybe this can be 
solved by an exclusive lock? or if the stale view comes from ORM cache, maybe 
we can clear the cache before running the query?
   > 
   > as this probably should also be part of the critical section of the main 
scheduler loop, what do you think?
   
   Thanks @Nataneljpwd, very valid point.
   
   I dug into it and agree the root cause is scheduler side HA race (stale 
scheduler view / competing updates), not only task runner finalization.
   
   This PR is intentionally narrow: it makes the TI state endpoint idempotent 
for duplicate same state terminal updates (returns 200), so duplicate SUCCESS 
report doesn’t fail the task process. Real conflicting transitions still return 
409.
   
   On lock/cache part:
   - clearing ORM cache won’t reliably solve it, because this race is across 
scheduler processes, not just one session cache.
   - a global/exclusive lock in the scheduler critical section would likely 
hurt HA throughput and increase contention.
   
   The durable fix should be in scheduler write path (`DagRun.schedule_tis`) 
with state guarded conditional updates so stale updates become no-op. I’ve 
handled that separately in #63367 with race focused tests.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to