houqp commented on pull request #10917: URL: https://github.com/apache/airflow/pull/10917#issuecomment-694567832
@ashb this is the tradeoff we will have to make here, segfault, OOM kill could happen to local_scheduler_job as well, although the chance is much lower. Either way, we need to make sure callbacks are only invoked from within one process, it's either the scheduler job, or raw_task, but not from both. This is happening in production for us at a high frequency right now, duplicated failure callbacks are invoked multiple times a day, while I don't recall running into hard die scenarios, so I think it's a reasonable tradeoff. On top of this, the refactoring here only changes behavior of callback invocation triggered by external state change. It's very unlikely that external state are updated right before task got into OOM or segfault. With or without this change, Airflow still invokes failure callback from within raw_task when state are not changed externally, so the problem you mentioned already exists in today's code base. From a design's point of view, it's better to invoke success and failure callbacks from the task monitor process, e.g. the local scheduler job, but it would require a much bigger refactoring. If that's what the community prefers, I can give that a stab. The run task command needs to be aware of whether it's been invoked with or without an external task monitor, and change the callback invocation logic based on that. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
