kaxil opened a new pull request, #55084:
URL: https://github.com/apache/airflow/pull/55084
closes #55045
Clearing a DAG with running tasks causes them to get stuck in `RESTARTING`
state indefinitely, with continuous heartbeat timeout events. This is a
regression in Airflow 3.x where the scheduler fails to process executor
completion events for `RESTARTING` tasks.
### Error Symptoms
- Tasks remain in `RESTARTING` state after clearing
- Continuous heartbeat timeout events: `404 Not Found` or `409 Conflict`
- Tasks never transition to retry or completion
- Executor reports `SUCCESS` but task state doesn't update
## Root Cause
The scheduler's `_process_executor_events()` function has two gatekeeping
mechanisms that filter which executor events get processed:
1. **First filter (lines 769-775)**: Only processes events for certain states
2. **Second filter (lines 859-864)**: Validates task instance state matches
executor event
Both filters were missing `TaskInstanceState.RESTARTING`, causing the
scheduler to ignore when tasks in `RESTARTING` state were successfully
terminated by the executor.
### Historical Context
The `RESTARTING` state was introduced in Airflow 2.2.0 (PR #16681) to handle
cleared running tasks gracefully:
- Prevent cleared tasks from transitioning to `FAILED` (like `SHUTDOWN` did)
- Allow tasks to be retried instead of marked as failed
- Originally relied on "zombie detection" mechanism in Airflow 2.x
In Airflow 3.x, the zombie detection mechanism was removed with the
transition from Job-based to Task SDK architecture, but the scheduler was never
updated to handle `RESTARTING` tasks in executor events.
### Why not immediately set state to None
The `RESTARTING` state serves a critical purpose in the clearing workflow:
1. **Race condition prevention**: When `clear_task_instances()` is called,
it immediately:
- Generates a new `ti.id` via `prepare_db_for_next_try()`
- Sets state to `RESTARTING`
- Commits it to DB
2. **Running task process continues**: The task process keeps running with
the old `ti.id` and attempts heartbeating, which fails with `404 Not Found`
since the database now has a new `ti.id`.
3. **Executor termination**: The executor eventually kills the running
process and reports `SUCCESS` for the terminated task.
4. **Scheduler coordination**: The `RESTARTING` state signals to the
scheduler that this task was intentionally cleared and should be retried (not
failed) when the executor confirms termination.
Setting the state directly to `None` during clearing would lose this
coordination mechanism and could lead to race conditions where the scheduler
processes a "completed" task before the executor has terminated the running
process.
## Future Considerations
This fix resolves the immediate regression but raises design questions for
future improvements:
1. **Callbacks**: Should `on_retry_callback` be triggered for cleared tasks?
2. **Email notifications**: Should `email_on_retry` be sent?
These questions are deferred to maintain scope and can be addressed in
follow-up discussions if needed.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]