whynick1 opened a new issue, #49517:
URL: https://github.com/apache/airflow/issues/49517

   ### Apache Airflow version
   
   2.10.5
   
   ### If "Other Airflow 2 version" selected, which one?
   
   _No response_
   
   ### What happened?
   
   When a task pod launches successfully, but the Kubernetes API server starts 
returning 429 Too Many Requests errors:
   - KubernetesJobWatcher crashes, causing the Airflow Scheduler to restart.
   - Upon restart, the Scheduler fails to re-adopt the running pod because the 
K8s API remains unavailable due to continued 429s.
   - As a result, the task is marked orphaned and its state is reset to None.
   - Airflow's logic only calls TaskInstanceHistory.record_ti() during failure 
handling if the task was in a running state. Since the state is now reset to 
None, record_ti() is never called.
   
   Consequently, there is no TaskInstanceHistory record, and the Airflow UI 
shows missing log links for that attempt.
   
   ### What you think should happen instead?
   
   Even if a task becomes orphaned and [its state is reset to 
None](https://github.com/apache/airflow/blob/2.10.5/airflow/jobs/scheduler_job_runner.py#L2001),
 Airflow should still record a TaskInstanceHistory entry to maintain a complete 
log history for user troubleshooting. We only ([record TI history when state is 
running](https://github.com/apache/airflow/blob/2.10.5/airflow/models/taskinstance.py#L3371-L3377)).
   ```
               if ti.state == TaskInstanceState.RUNNING:
                   # If the task instance is in the running state, it means it 
raised an exception and
                   # about to retry so we record the task instance history. For 
other states, the task
                   # instance was cleared and already recorded in the task 
instance history.
                   from airflow.models.taskinstancehistory import 
TaskInstanceHistory
   
                   TaskInstanceHistory.record_ti(ti, session=session)
   ```
   
   
   ### How to reproduce
   
   Steps to trigger this behavior:
   1. Launch a task pod successfully in Airflow running with KubernetesExecutor 
or CeleryKubernetesExecutor.
   2. Artificially throttle the Kubernetes API server (e.g., by applying API 
rate limiting policies or load testing the API) so that it starts returning 429 
Too Many Requests consistently.
   3. Observe that:
     - KubernetesJobWatcher crashes.
     - Scheduler restarts.
     - Scheduler is unable to re-adopt the running task pod.
     - The task is marked as orphaned.
     - TaskInstance state is reset to None.
     - No TaskInstanceHistory entry is created for the failed attempt.
     - Airflow UI shows missing log link for the corresponding attempt.
   
   ### Operating System
   
   Debian GNU/Linux
   
   ### Versions of Apache Airflow Providers
   
   _No response_
   
   ### Deployment
   
   Official Apache Airflow Helm Chart
   
   ### Deployment details
   
   _No response_
   
   ### Anything else?
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to