Re: [I] Deferrable operator tasks do not call `on_kill` method when fail or restarted [airflow]

via GitHub Mon, 23 Sep 2024 09:40:44 -0700


thesuperzapper commented on issue #36090:
URL: https://github.com/apache/airflow/issues/36090#issuecomment-2368810776


   @msumit we already have a generic workaround based on my code in: 
https://github.com/apache/airflow/issues/36090#issuecomment-2094972855
   
   It's not perfect, it doesn't handle the case that the triggerer is force 
killed (power loss) at the exact same time that the user manually changes the 
state of the task, but that's a very rare situation (and also applies to 
non-deferred `on_kill` during a worker failures).
   
   It relies on catching the `asyncio.CancelledError` exception and canceling 
the external job from within the triggerer itself. But only if the current 
state of the task (read from DB) is not deferred, to avoid canceling when the 
triggerer is shut down normally.
   
   The maintainers of the providers __really__ need to implement this 
workaround for all deferrable operators with `on_kill`, otherwise, we need a 
warning on each of the deferrable operator docs.  Users do not expect this 
behavior, and leave their external jobs running accidentally.
   
   ---
   
   __Airflow 3 Proposal:__
   
   My proposal for a unified clean-up solution is to introduce a new 
TaskInstance state called `pending cleanup`.
   
   When an operator defines a "cleanup handler", the task would enter this 
state when it terminates for any reason, including success, failure, and 
clearing.
   
   The "cleanup handler" is guaranteed to run __at least once__, and would be 
passed the following information to help it decide what to do:
   
   1. Which state the task will go into after cleanup (success, failure, clear, 
up-for-retry)
   2. How the state changed (if it was manually set, failed naturally, or was 
cleared)
   3. The parameters of the task
   4. Any return xcom values (if the task succeeded)
   5. Runtime metadata set by the operator author using a new api (important if 
the task triggers an external job and gets an ID back, which could be required 
to cancel the external job)
   
   The question then becomes where to run the cleanup handler code. I think it 
makes sense to run it in the existing triggerers, as we can reuse the existing 
task serialization behavior, and know that it will run at least once.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Deferrable operator tasks do not call `on_kill` method when fail or restarted [airflow]

Reply via email to