The GitHub Actions job "Tests (AMD)" on airflow.git/backport-173c2a1-v3-2-test has failed. Run started by GitHub user vatsrahul1001 (triggered by vatsrahul1001).
Head commit for run: 65504bd1d4fac592bdfc1e3ddfd1d46f9ce8d957 / Jarek Potiuk <[email protected]> Recover stuck TIs when direct terminal-state API call fails (#66574) * Recover stuck TIs when direct terminal-state API call fails The supervisor's _handle_request for SucceedTask, RetryTask, DeferTask, and RescheduleTask set _terminal_state BEFORE calling the matching client.task_instances.{succeed,retry,defer,reschedule}() API. If that API call raised (transient network blip, server 5xx, etc.), _terminal_state was set on the supervisor but the server never saw the transition. The supervisor's update_task_state_if_needed then saw final_state in STATES_SENT_DIRECTLY and short-circuited the recovery finish() call -- leaving the TaskInstance stuck RUNNING on the server forever, blocking downstream dependencies and triggering false alerts. Two-part fix: 1. Make the direct API call FIRST. Only set _terminal_state and the new _terminal_state_synced_to_server flag after the call returns successfully. If the API raises, both stay unset and the exception propagates to handle_requests, where the existing catch-all sends an ErrorResponse to the task subprocess. 2. Have update_task_state_if_needed always call finish() when _terminal_state_synced_to_server is False, regardless of what final_state happens to return. The finish() API takes the state value, so a SUCCESS / DEFERRED / etc. transition that originally failed is re-attempted via finish() on subprocess exit. Pre-existing semantics for the no-direct-API states (FAILED, UP_FOR_RETRY without RetryTask, etc.) preserved -- those land in the same finish() branch. Tests added: - _terminal_state not set when succeed() raises. - update_task_state_if_needed calls finish() when synced flag is False, even with final_state == SUCCESS. - update_task_state_if_needed skips finish() when synced flag is True (preserves the existing happy-path optimisation). Reported by the L3 ASVS sweep at apache/tooling-agents#24 (FINDING-007). * Refactor terminal-state dispatch and parametrize tests across all 4 states Address review feedback on #66574: - Extract `_send_terminal_state_msg` helper so the per-msg-type dispatch for succeed / retry / defer / reschedule lives in one place. Both `_handle_request` and `_replay_pending_terminal_state_msg` now go through it instead of duplicating the four-branch isinstance chain. - Parametrize the two recovery tests over all four terminal-state message types (was only Succeed + Defer); add UP_FOR_RETRY and UP_FOR_RESCHEDULE coverage. * Narrow _pending_terminal_state_msg type to satisfy mypy The field was annotated as BaseModel | None, but _send_terminal_state_msg expects SucceedTask | RetryTask | DeferTask | RescheduleTask. mypy couldn't prove the narrowing at the _replay_pending_terminal_state_msg call site. Tighten the field type to the exact union the setter assigns and the consumer accepts. --------- Co-authored-by: vatsrahul1001 <[email protected]> Co-authored-by: Rahul Vats <[email protected]> (cherry picked from commit 173c2a1806dd087272ec287fb923917630ef8f81) Report URL: https://github.com/apache/airflow/actions/runs/26120400424 With regards, GitHub Actions via GitBox --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
