thesuperzapper commented on issue #36090: URL: https://github.com/apache/airflow/issues/36090#issuecomment-2368810776
@msumit we already have a generic workaround based on my code in: https://github.com/apache/airflow/issues/36090#issuecomment-2094972855 It's not perfect, it doesn't handle the case that the triggerer is force killed (power loss) at the exact same time that the user manually changes the state of the task, but that's a very rare situation (and also applies to non-deferred `on_kill` during a worker failures). It relies on catching the `asyncio.CancelledError` exception and canceling the external job from within the triggerer itself. But only if the current state of the task (read from DB) is not deferred, to avoid canceling when the triggerer is shut down normally. The maintainers of the providers __really__ need to implement this workaround for all deferrable operators with `on_kill`, otherwise, we need a warning on each of the deferrable operator docs. Users do not expect this behavior, and leave their external jobs running accidentally. --- __Airflow 3 Proposal:__ My proposal for a unified clean-up solution is to introduce a new TaskInstance state called `pending cleanup`. When an operator defines a "cleanup handler", the task would enter this state when it terminates for any reason, including success, failure, and clearing. The "cleanup handler" is guaranteed to run __at least once__, and would be passed the following information to help it decide what to do: 1. Which state the task will go into after cleanup (success, failure, clear, up-for-retry) 2. How the state changed (if it was manually set, failed naturally, or was cleared) 3. The parameters of the task 4. Any return xcom values (if the task succeeded) 5. Runtime metadata set by the operator author using a new api (important if the task triggers an external job and gets an ID back, which could be required to cancel the external job) The question then becomes where to run the cleanup handler code. I think it makes sense to run it in the existing triggerers, as we can reuse the existing task serialization behavior, and know that it will run at least once. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
