whynick1 opened a new issue, #49517: URL: https://github.com/apache/airflow/issues/49517
### Apache Airflow version 2.10.5 ### If "Other Airflow 2 version" selected, which one? _No response_ ### What happened? When a task pod launches successfully, but the Kubernetes API server starts returning 429 Too Many Requests errors: - KubernetesJobWatcher crashes, causing the Airflow Scheduler to restart. - Upon restart, the Scheduler fails to re-adopt the running pod because the K8s API remains unavailable due to continued 429s. - As a result, the task is marked orphaned and its state is reset to None. - Airflow's logic only calls TaskInstanceHistory.record_ti() during failure handling if the task was in a running state. Since the state is now reset to None, record_ti() is never called. Consequently, there is no TaskInstanceHistory record, and the Airflow UI shows missing log links for that attempt. ### What you think should happen instead? Even if a task becomes orphaned and [its state is reset to None](https://github.com/apache/airflow/blob/2.10.5/airflow/jobs/scheduler_job_runner.py#L2001), Airflow should still record a TaskInstanceHistory entry to maintain a complete log history for user troubleshooting. We only ([record TI history when state is running](https://github.com/apache/airflow/blob/2.10.5/airflow/models/taskinstance.py#L3371-L3377)). ``` if ti.state == TaskInstanceState.RUNNING: # If the task instance is in the running state, it means it raised an exception and # about to retry so we record the task instance history. For other states, the task # instance was cleared and already recorded in the task instance history. from airflow.models.taskinstancehistory import TaskInstanceHistory TaskInstanceHistory.record_ti(ti, session=session) ``` ### How to reproduce Steps to trigger this behavior: 1. Launch a task pod successfully in Airflow running with KubernetesExecutor or CeleryKubernetesExecutor. 2. Artificially throttle the Kubernetes API server (e.g., by applying API rate limiting policies or load testing the API) so that it starts returning 429 Too Many Requests consistently. 3. Observe that: - KubernetesJobWatcher crashes. - Scheduler restarts. - Scheduler is unable to re-adopt the running task pod. - The task is marked as orphaned. - TaskInstance state is reset to None. - No TaskInstanceHistory entry is created for the failed attempt. - Airflow UI shows missing log link for the corresponding attempt. ### Operating System Debian GNU/Linux ### Versions of Apache Airflow Providers _No response_ ### Deployment Official Apache Airflow Helm Chart ### Deployment details _No response_ ### Anything else? _No response_ ### Are you willing to submit PR? - [x] Yes I am willing to submit a PR! ### Code of Conduct - [x] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
