Vamsi-klu opened a new pull request, #64023:
URL: https://github.com/apache/airflow/pull/64023

   ## Summary
   
   Fixes a race condition where a successfully completed task gets incorrectly 
marked as FAILED due to network retry behavior.
   
   **Root cause:** When a task completes and the supervisor sends a SUCCESS 
state update to the API server, if the HTTP response is lost (network timeout), 
httpx retries the request. The API server sees the task is already in SUCCESS 
(not RUNNING) and returns 409 Conflict. This error cascades: supervisor sends 
an error to the task process, task exits with code 1, supervisor interprets 
exit code 1 as FAILED, and attempts to overwrite the correct SUCCESS state.
   
   **Fix:** Idempotent terminal state handling at three layers:
   
   - **API Server** (`task_instances.py`): Return 204 no-op when a terminal 
state update matches the current state (e.g., SUCCESS→SUCCESS). Cross-state 
conflicts (SUCCESS→FAILED) still return 409.
   - **Supervisor `_handle_request`** (`supervisor.py`): Catch 409 on 
`SucceedTask` and `RetryTask` when the task is already in the target state, 
preventing error propagation to the task process.
   - **Supervisor `update_task_state_if_needed`** (`supervisor.py`): Handle 409 
during post-exit state reporting to prevent unhandled exceptions from `wait()`.
   
   ## Files Changed
   
   | File | Change |
   |------|--------|
   | 
`airflow-core/src/airflow/api_fastapi/execution_api/routes/task_instances.py` | 
Added `_get_requested_state()` helper + idempotency check in 
`ti_update_state()` |
   | `task-sdk/src/airflow/sdk/execution_time/supervisor.py` | Added 
`_is_already_in_target_state()` helper + 409 handling in 
`SucceedTask`/`RetryTask` handlers + 409 handling in 
`update_task_state_if_needed()` |
   | 
`airflow-core/tests/unit/api_fastapi/execution_api/versions/head/test_task_instances.py`
 | 7 new tests: idempotent success/failed/skipped + 4 parametrized cross-state 
conflict tests |
   | `task-sdk/tests/task_sdk/execution_time/test_supervisor.py` | 8 new tests: 
idempotent 409 for succeed/retry, different-state 409 propagation, 
`update_task_state_if_needed` conflict handling, `_is_already_in_target_state` 
unit tests |
   
   ## Impact
   
   - **Low risk**: The API change adds an early-return path only when 
`previous_state == requested_state` — no existing behavior is altered for valid 
or genuinely conflicting transitions.
   - **Defense-in-depth**: The supervisor-side handling is a secondary safety 
net. Even if the API fix regresses, the supervisor gracefully handles the 
idempotent 409 case.
   
   ## Test Plan
   
   - [x] 7 new API server tests (idempotent + cross-state conflict)
   - [x] 8 new supervisor tests (409 handling + helper unit tests)
   - [x] Full `TestTIUpdateState` class: 41 passed, 1 skipped
   - [ ] `test_ti_update_state_reschedule_mysql_limit` — skipped because it 
requires a MySQL backend (`BACKEND` env var not set in local test environment). 
This test is unrelated to the idempotency changes and will pass in CI with the 
MySQL backend configured.
   
   closes: #63183
   
   ---
   
   ##### Was generative AI tooling used to co-author this PR?
   
   - [X] Yes — Claude Code (claude-opus-4-6)
   
   Generated-by: Claude Code (claude-opus-4-6) following [the 
guidelines](https://github.com/apache/airflow/blob/main/contributing-docs/05_pull_requests.rst#gen-ai-assisted-contributions)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to