kaxil opened a new pull request, #55084:
URL: https://github.com/apache/airflow/pull/55084

   closes #55045
   
   
   Clearing a DAG with running tasks causes them to get stuck in `RESTARTING` 
state indefinitely, with continuous heartbeat timeout events. This is a 
regression in Airflow 3.x where the scheduler fails to process executor 
completion events for `RESTARTING` tasks.
   
   ### Error Symptoms
   - Tasks remain in `RESTARTING` state after clearing
   - Continuous heartbeat timeout events: `404 Not Found` or `409 Conflict`
   - Tasks never transition to retry or completion
   - Executor reports `SUCCESS` but task state doesn't update
   
   ## Root Cause
   
   The scheduler's `_process_executor_events()` function has two gatekeeping 
mechanisms that filter which executor events get processed:
   
   1. **First filter (lines 769-775)**: Only processes events for certain states
   2. **Second filter (lines 859-864)**: Validates task instance state matches 
executor event
   
   Both filters were missing `TaskInstanceState.RESTARTING`, causing the 
scheduler to ignore when tasks in `RESTARTING` state were successfully 
terminated by the executor.
   
   ### Historical Context
   
   The `RESTARTING` state was introduced in Airflow 2.2.0 (PR #16681) to handle 
cleared running tasks gracefully:
   - Prevent cleared tasks from transitioning to `FAILED` (like `SHUTDOWN` did)
   - Allow tasks to be retried instead of marked as failed
   - Originally relied on "zombie detection" mechanism in Airflow 2.x
   
   In Airflow 3.x, the zombie detection mechanism was removed with the 
transition from Job-based to Task SDK architecture, but the scheduler was never 
updated to handle `RESTARTING` tasks in executor events.
   
   ### Why not immediately set state to None
   
   The `RESTARTING` state serves a critical purpose in the clearing workflow:
   
   1. **Race condition prevention**: When `clear_task_instances()` is called, 
it immediately:
      - Generates a new `ti.id` via `prepare_db_for_next_try()`
      - Sets state to `RESTARTING` 
      - Commits it to DB
   
   2. **Running task process continues**: The task process keeps running with 
the old `ti.id` and attempts heartbeating, which fails with `404 Not Found` 
since the database now has a new `ti.id`.
   
   3. **Executor termination**: The executor eventually kills the running 
process and reports `SUCCESS` for the terminated task.
   
   4. **Scheduler coordination**: The `RESTARTING` state signals to the 
scheduler that this task was intentionally cleared and should be retried (not 
failed) when the executor confirms termination.
   
   Setting the state directly to `None` during clearing would lose this 
coordination mechanism and could lead to race conditions where the scheduler 
processes a "completed" task before the executor has terminated the running 
process.
   
   ## Future Considerations
   
   This fix resolves the immediate regression but raises design questions for 
future improvements:
   
   1. **Callbacks**: Should `on_retry_callback` be triggered for cleared tasks?
   2. **Email notifications**: Should `email_on_retry` be sent?
   
   These questions are deferred to maintain scope and can be addressed in 
follow-up discussions if needed.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to