houqp commented on pull request #10917:
URL: https://github.com/apache/airflow/pull/10917#issuecomment-694567832


   @ashb this is the tradeoff we will have to make here, segfault, OOM kill 
could happen to local_scheduler_job as well, although the chance is much lower. 
Either way, we need to make sure callbacks are only invoked from within one 
process, it's either the scheduler job, or raw_task, but not from both. This is 
happening in production for us at a high frequency right now, duplicated 
failure callbacks are invoked multiple times a day, while I don't recall 
running into hard die scenarios, so I think it's a reasonable tradeoff.
   
   On top of this, the refactoring here only changes behavior of callback 
invocation triggered by external state change. It's very unlikely that external 
state are updated right before task got into OOM or segfault. With or without 
this change, Airflow still invokes failure callback from within raw_task when 
state are not changed externally, so the problem you mentioned already exists 
in today's code base.
   
   From a design's point of view, it's better to invoke success and failure 
callbacks from the task monitor process, e.g. the local scheduler job, but it 
would require a much bigger refactoring. If that's what the community prefers, 
I can give that a stab. The run task command needs to be aware of whether it's 
been invoked with or without an external task monitor, and change the callback 
invocation logic based on that.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to