houqp edited a comment on issue #14422:
URL: https://github.com/apache/airflow/issues/14422#issuecomment-820839115


   Here is what I think the trace is. Every pod starts with the local task job 
(monitoring/manager task), which runs the actual task code (`_run_raw_task`) in 
a subprocess. When the pod is force killed, the local task job will receive the 
SIGTERM first:
   
   
https://github.com/apache/airflow/blob/d7bc217957b65471ca5f2e259bba15c71a2b0c41/airflow/jobs/local_task_job.py#L75-L79
   
   This handler prints the first `Received SIGTERM. Terminating subprocesses` 
log output. Then in `self.on_kill`, it sends another SIGTERM to the subprocess 
that runs that actual task, and which hits the following code:
   
   
   
https://github.com/apache/airflow/blob/d7bc217957b65471ca5f2e259bba15c71a2b0c41/airflow/models/taskinstance.py#L1263-L1266
   
   This handler prints the second `Received SIGTERM. Terminating subprocesses` 
log output.
   
   
   The design in https://github.com/apache/airflow/pull/10917 is to move all 
success/failure callback invocations into local task job so we can handle 
https://github.com/apache/airflow/issues/11086. We can't rely on the actual 
task process (`_run_raw_task`) to execute the callback because that process 
itself runs arbitrary user code and can be force killed by the kernel due to 
OOM or it could crash by itself due to other bugs. The only safe way to make 
sure we execute callbacks no matter how the task subprocess exits is to execute 
them from the monitoring/manager task (local task job), which doesn't run user 
code so we can tightly control its behavior and memory foot print.
   
   I think the source of this bug is because local job task only calls 
`self.on_kill` in the SIGTERM handler, which is only responsible for killing 
the task subprocess using SIGTERM:
   
   
https://github.com/apache/airflow/blob/d7bc217957b65471ca5f2e259bba15c71a2b0c41/airflow/jobs/local_task_job.py#L155-L157
   
   Here is a potential simple fix. In local task job's `signal_handler` after 
we called `self.on_kill`, we can explicitly call `self.handle_task_exit`, which 
will check the task state and execute callbacks accordingly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to