kaxil commented on issue #14422:
URL: https://github.com/apache/airflow/issues/14422#issuecomment-815368180


   I think this happened in 2.0.1 mainly because of the following traceL
   
   When you delete the POD, the KubernetesExecutor executes the following:
   
   1)
   
https://github.com/apache/airflow/blob/beb8af5ac6c438c29e2c186145115fb1334a3735/airflow/executors/kubernetes_executor.py#L195-L200
   
   i.e. it tried to reschedule your POD as evident by the logs in the Issue 
description too.
   
   2) 
   
   Which then executes the following and puts the TaskInstance Key to 
`result_queue`:
   
   
https://github.com/apache/airflow/blob/beb8af5ac6c438c29e2c186145115fb1334a3735/airflow/executors/kubernetes_executor.py#L350-L359
   
   3)
   
   The TI is then marked with state `RESCHEDULE` (atleast according to executor 
events) in:
   
   
https://github.com/apache/airflow/blob/beb8af5ac6c438c29e2c186145115fb1334a3735/airflow/executors/kubernetes_executor.py#L522-L542
   
   (4)
   Now all the above 3 events were happening from `KubernetesExecutor` point of 
view.
   
   At the same time when the POD was killed (sent SIGTERM), the Task Pod 
receives the SIGTERM and executes the following call since we override SIGTERM 
call and raises `AirflowException` (which matches your logs and stacktrace):
   
   
https://github.com/apache/airflow/blob/beb8af5ac6c438c29e2c186145115fb1334a3735/airflow/models/taskinstance.py#L1238-L1241
   
   This `AirflowException` is handled here:
   
   
https://github.com/apache/airflow/blob/beb8af5ac6c438c29e2c186145115fb1334a3735/airflow/models/taskinstance.py#L1142-L1150
   
   which then calls inside `handle_failure`
   
   
https://github.com/apache/airflow/blob/beb8af5ac6c438c29e2c186145115fb1334a3735/airflow/models/taskinstance.py#L1484-L1490
   
   which does not run a failure_callback. This bug might have been introduced 
in 
https://github.com/apache/airflow/commit/efe163a1fddfd66fa402231906e96733efddf8af
 where we moved running callbacks in `LocalTaskJob`:
   
   
https://github.com/apache/airflow/blob/beb8af5ac6c438c29e2c186145115fb1334a3735/airflow/jobs/local_task_job.py#L123-L126
   
   
https://github.com/apache/airflow/blob/beb8af5ac6c438c29e2c186145115fb1334a3735/airflow/jobs/local_task_job.py#L144-L153
   
   I don't see `LocalTaskJob` exit logs in your trace so I am not sure why that 
happened.
   
   Secondly, recently @jedcunningham changed (3) where we mark task as 
RESCHEDULED to FAILED in https://github.com/apache/airflow/pull/14810 -- which 
means atleast the logging around that will be taken care off, we still need to 
investigate on why LocalTaskJob was not executed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to