[I] Print kubernetes failure Status and Reason on pod failures [airflow]

via GitHub Mon, 19 Feb 2024 08:07:01 -0800


hterik opened a new issue, #37548:
URL: https://github.com/apache/airflow/issues/37548


   ### Description
   
   We occasionally see KubernetesExecutor tasks getting lost in cyberspace with 
no logs describing why in the airflow UI.
   
   If admins looks into the scheduler logs (Airflow 2.7.1), the following can 
be seen:
   
   ```
   15:55:58 {scheduler_job_runner.py:636} INFO - Sending 
TaskInstanceKey(dag_id='G', task_id='S', run_id='RRR', try_number=1, 
map_index=-1) to executor with priority 31 and queue kubernetes
   15:55:58 {base_executor.py:144} INFO - Adding to queue: ['airflow', 'tasks', 
'run', 'G', 'S', 'RRR', '--local', '--subdir', 'DAGS_FOLDER/dag_G.py.py']
   15:55:58 {kubernetes_executor.py:319} INFO - Add task 
TaskInstanceKey(dag_id='G', task_id='S', run_id='RRR', try_number=1, 
map_index=-1) with command ['airflow', 'tasks', 'run', 'G', 'S', 'RRR', 
'--local', '--subdir', 'DAGS_FOLDER/dag_G.py.py']
   15:55:58 {kubernetes_executor_utils.py:395} INFO - Creating kubernetes pod 
for job is TaskInstanceKey(dag_id='G', task_id='S', run_id='RRR', try_number=1, 
map_index=-1), with pod name PPP, annotations: <omitted>
   15:55:58 {scheduler_job_runner.py:686} INFO - Received executor event with 
state queued for task instance TaskInstanceKey(dag_id='G', task_id='S', 
run_id='RRR', try_number=1, map_index=-1)
   15:55:58 {scheduler_job_runner.py:713} INFO - Setting external_id for 
<TaskInstance: G.S RRR [queued]> to 800028
   
   15:56:20 <<<OUTSIDE AIRFLOW>>>:  Kubernetes Eviction event: "The node was 
low on resource: memory. Threshold quantity: .....
   
   15:56:23 {kubernetes_executor.py:363} INFO - Changing state of 
(TaskInstanceKey(dag_id='G', task_id='S', run_id='RRR', try_number=1, 
map_index=-1), <TaskInstanceState.FAILED: 'failed'>, 'PPP', 'default', 
'311004175') to failed
   15:56:23 {kubernetes_executor.py:455} INFO - Patched pod 
TaskInstanceKey(dag_id='G', task_id='S', run_id='RRR', try_number=1, 
map_index=-1) in namespace default to mark it as done
   15:56:23 {scheduler_job_runner.py:686} INFO - Received executor event with 
state failed for task instance TaskInstanceKey(dag_id='G', task_id='S', 
run_id='RRR', try_number=1, map_index=-1)
   15:56:23 {scheduler_job_runner.py:723} INFO - TaskInstance Finished: 
dag_id=G, task_id=S, run_id=RRR, map_index=-1, run_start_date=None, 
run_end_date=None, run_duration=None, state=queued, executor_state=failed, 
try_number=1, max_tries=0, job_id=None, pool=default_pool, queue=kubernetes, 
priority_weight=31, operator=BranchPythonOperator, queued_dttm=2024-02-19 
14:55:58.808536+00:00, queued_by_job_id=800028, pid=None
   15:56:23 {scheduler_job_runner.py:771} ERROR - Executor reports task 
instance <TaskInstance: G.S RRR [queued]> finished (failed) although the task 
says it's queued. (Info: None) Was the task killed externally?
   15:56:23 {taskinstance.py:1937} ERROR - Executor reports task instance 
<TaskInstance: G.S RRR [queued]> finished (failed) although the task says it's 
queued. (Info: None) Was the task killed externally?
   15:56:25 {kubernetes_executor.py:363} INFO - Changing state of 
(TaskInstanceKey(dag_id='G', task_id='S', run_id='RRR', try_number=1, 
map_index=-1), <TaskInstanceState.FAILED: 'failed'>, 'PPP', 'default', 
'311004178') to failed
   15:56:25 {kubernetes_executor.py:455} INFO - Patched pod 
TaskInstanceKey(dag_id='G', task_id='S', run_id='RRR', try_number=1, 
map_index=-1) in namespace default to mark it as done
   15:56:25 {kubernetes_executor.py:363} INFO - Changing state of 
(TaskInstanceKey(dag_id='G', task_id='S', run_id='RRR', try_number=1, 
map_index=-1), <TaskInstanceState.FAILED: 'failed'>, 'PPP', 'default', 
'311004180') to failed
   15:56:25 {kubernetes_executor.py:455} INFO - Patched pod 
TaskInstanceKey(dag_id='G', task_id='S', run_id='RRR', try_number=1, 
map_index=-1) in namespace default to mark it as done
   15:56:25 {scheduler_job_runner.py:686} INFO - Received executor event with 
state failed for task instance TaskInstanceKey(dag_id='G', task_id='S', 
run_id='RRR', try_number=1, map_index=-1)
   15:56:25 {scheduler_job_runner.py:723} INFO - TaskInstance Finished: 
dag_id=G, task_id=S, run_id=RRR, map_index=-1, run_start_date=None, 
run_end_date=2024-02-19 14:56:23.933950+00:00, run_duration=None, state=failed, 
executor_state=failed, try_number=1, max_tries=0, job_id=None, 
pool=default_pool, queue=kubernetes, priority_weight=31, 
operator=BranchPythonOperator, queued_dttm=2024-02-19 14:55:58.808536+00:00, 
queued_by_job_id=800028, pid=None
   
   ```
   
   It would be a lot easier to debug such issues if 
   A).  The scheduler logs somehow mentioned the Pod failure Reason=Evicted and 
status=Failed. These can be found on the V1Pod object returned by kubernetes 
API.
   B) The Airflow UI somehow surfaced this error, instead of not showing 
anything at all.
   
   ### Use case/motivation
   
   _No response_
   
   ### Related issues
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Print kubernetes failure Status and Reason on pod failures [airflow]

Reply via email to