[I] TI history missing after Scheduler restart during K8s 429 error [airflow]

via GitHub Mon, 21 Apr 2025 11:19:16 -0700


whynick1 opened a new issue, #49517:
URL: https://github.com/apache/airflow/issues/49517

### Apache Airflow version

2.10.5

### If "Other Airflow 2 version" selected, which one?

_No response_

### What happened?

When a task pod launches successfully, but the Kubernetes API server starts
returning 429 Too Many Requests errors:
- KubernetesJobWatcher crashes, causing the Airflow Scheduler to restart.
- Upon restart, the Scheduler fails to re-adopt the running pod because the
K8s API remains unavailable due to continued 429s.
- As a result, the task is marked orphaned and its state is reset to None.
- Airflow's logic only calls TaskInstanceHistory.record_ti() during failure
handling if the task was in a running state. Since the state is now reset to
None, record_ti() is never called.

Consequently, there is no TaskInstanceHistory record, and the Airflow UI
shows missing log links for that attempt.

### What you think should happen instead?

Even if a task becomes orphaned and [its state is reset to
None](https://github.com/apache/airflow/blob/2.10.5/airflow/jobs/scheduler_job_runner.py#L2001),
Airflow should still record a TaskInstanceHistory entry to maintain a complete
log history for user troubleshooting. We only ([record TI history when state is
running](https://github.com/apache/airflow/blob/2.10.5/airflow/models/taskinstance.py#L3371-L3377)).
```
if ti.state == TaskInstanceState.RUNNING:
# If the task instance is in the running state, it means it
raised an exception and
# about to retry so we record the task instance history. For
other states, the task
# instance was cleared and already recorded in the task
instance history.
from airflow.models.taskinstancehistory import
TaskInstanceHistory

TaskInstanceHistory.record_ti(ti, session=session)
```

### How to reproduce

Steps to trigger this behavior:
1. Launch a task pod successfully in Airflow running with KubernetesExecutor
or CeleryKubernetesExecutor.
2. Artificially throttle the Kubernetes API server (e.g., by applying API
rate limiting policies or load testing the API) so that it starts returning 429
Too Many Requests consistently.
3. Observe that:
- KubernetesJobWatcher crashes.
- Scheduler restarts.
- Scheduler is unable to re-adopt the running task pod.
- The task is marked as orphaned.
- TaskInstance state is reset to None.
- No TaskInstanceHistory entry is created for the failed attempt.
- Airflow UI shows missing log link for the corresponding attempt.

### Operating System

Debian GNU/Linux

### Versions of Apache Airflow Providers

_No response_

### Deployment

Official Apache Airflow Helm Chart

### Deployment details

_No response_

### Anything else?

_No response_

### Are you willing to submit PR?

- [x] Yes I am willing to submit a PR!

### Code of Conduct

- [x] I agree to follow this project's [Code of
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] TI history missing after Scheduler restart during K8s 429 error [airflow]

Reply via email to