sw-cyderes opened a new issue, #67287:
URL: https://github.com/apache/airflow/issues/67287

   ### Under which category would you file this issue?
   
   Airflow Core
   
   ### Apache Airflow version
   
   3.2.1
   
   ### What happened and how to reproduce it?
   
   Same race condition as #66374, but the post-defer TI lands in queued state 
rather than scheduled, so the fix in #66431 / backport #67089 (merged for 
3.2.2) may not cover it. The scheduler still treats the stale executor success 
event from the worker's defer-exit as a state mismatch and kills the task 
externally.
   
   Timeline from one occurrence:
   
   [2026-05-20 13:37:14] worker     TI started (try_number=1)
   [2026-05-20 13:37:18] worker     Pausing task as DEFERRED  (defer() called, 
worker exits cleanly)
   [2026-05-20 13:37:45] scheduler  Executor LocalExecutor reported that the 
task
                                    instance <TaskInstance: .... [queued]> 
finished with
                                    state success, but the task instance's 
state attribute is queued.
                                   Task marked as up_for_retry
   [2026-05-20 13:37:??] worker     try_number=2 starts, success
   
   Notice the TI state attribute is queued (and the TI bracket label is also 
[queued]), not scheduled. The fix added by #66431 in process_executor_events 
only marks the event as ti_requeued when state == SCHEDULED and next_method is 
not None. The same logical condition (resume-after-defer with a stale executor 
success) can also leave the TI in QUEUED under load — that path falls through 
the new branch and the task is still failed/retried.
   
   We see this race fire on a subset of deferred tasks per DAG run. It is not 
reproducible deterministically — same code, same workload, sometimes races, 
sometimes doesn't.
   
   Differences and linkages with current tickets
   
   - #66374 (CLOSED): Same race, scheduled-state variant. Fixed by #66431 
(merged to main 2026-05-18, milestone 3.2.2) and backported via #67089 to 
v3-2-test. The fix added the ti_requeued branch for state == SCHEDULED with 
next_method set. This report is the analogous case for state == QUEUED, which 
the existing branch does not cover.
   - #66431 / #67089: The PRs that fix #66374. After applying these, the 
scheduled-state mismatch goes away, but in our environment the queued-state 
mismatch continues to fire on the same operator (RunMergerEc2Operator) on the 
same defer path. The fix should likely be extended to also treat state == 
QUEUED with next_method is not None as a stale defer-exit success, since the 
trigger may have rescheduled the TI into either QUEUED or SCHEDULED depending 
on which scheduler loop iteration processes the trigger event first.
   - #53797 (CLOSED, inconclusive): Earlier report of the [queued]-state 
mismatch on 3.0.3 with LocalExecutor. Closed because the original reporter 
couldn't repro on their helm deployment ("edge case for docker macOS"), but the 
comment thread has follow-up reports on Linux / GKE / 3.0.4. This appears to be 
the same family of race, still present in 3.2.x, and worth treating as a 
distinct, ongoing bug rather than a Docker-for-Mac quirk.
   - #23824 / #23846: The original 2.x-era fix for the queued-state mismatch. 
The 3.x task-SDK / API-server architecture appears to have re-opened this state 
pair on the defer path.
   
   Reproducer (probabilistic)
   
   1. Airflow 3.2.x, LocalExecutor 
   2. A deferrable operator that: Calls self.defer(trigger=..., 
method_name="execute_complete").
   3. Observe the audit/event log. A subset of TIs may hit the [queued] 
mismatch instead of scheduled.
   
   ### What you think should happen instead?
   
   The ti_requeued branch in process_executor_events should also handle the 
queued-state variant, e.g.:
   
   
   if (
       state == TaskInstanceState.SUCCESS
       and ti.next_method is not None
       and ti.state in (TaskInstanceState.SCHEDULED, TaskInstanceState.QUEUED)
   ):
       # stale defer-exit success, treat as requeue
       ...
   
   ### Operating System
   
   Airflow runs in the official apache/airflow:3.2.1 Docker image on Linux. 
Host: Ubuntu / Linux.
   
   ### Deployment
   
   Docker-Compose
   
   ### Apache Airflow Provider(s)
   
   _No response_
   
   ### Versions of Apache Airflow Providers
   
   _No response_
   
   ### Official Helm Chart version
   
   Not Applicable
   
   ### Kubernetes Version
   
   _No response_
   
   ### Helm Chart configuration
   
   _No response_
   
   ### Docker Image customizations
   
   _No response_
   
   ### Anything else?
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to