yeachan153 opened a new issue #21900: URL: https://github.com/apache/airflow/issues/21900
### Description The `kubernetes_pod_operator` currently has a `reattach_on_restart` parameter that attempts to reattach to running pods instead of creating a new pod in case a scheduler dies while the task is running. We would like for this feature to also work when the worker dies as well. Currently, a dying worker receives a SIGTERM and triggers the `on_kill` method: https://github.com/apache/airflow/blob/ace8c6e942ff5554639801468b971915b7c0e9b9/airflow/models/taskinstance.py#L1425 This ends up deleting the pod that was created: https://github.com/apache/airflow/blob/ace8c6e942ff5554639801468b971915b7c0e9b9/airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py#L438 We currently got around this problem by removing the the `on_kill` call upon receiving a SIGTERM and pushing an xcom indicating that the worker was killed. We then enabled retries for the `kubernetes_pod_operator` and modified the [is_eligible_to_retry](https://github.com/apache/airflow/blob/ace8c6e942ff5554639801468b971915b7c0e9b9/airflow/models/taskinstance.py#L1825) function to check for the presence of this xcom and only retry if found, allowing us to retry only when the worker was killed. Unfortunately, this is not a perfect solution because clearing a task / stopping a task via the UI triggers the same signal handler as when a worker is killed externally. Therefore, stopping the task now does not kill the pod, and clearing the task causes a reattach when we would ideally like a restart. ### Use case/motivation Since the pod itself may fail for a valid reason, we don't just want to add more retries. In that situation, it will also not re-attach but start a completely new pod since the original pod would have been cleaned up. We specifically want the reattaching to happen when the worker dies for infrastructure related reasons. This is useful for instance, during deployment updates in kubernetes. It's currently quite a disruptive process because all the running pods are first killed, and if retries are not enabled (for reasons mentioned above), we have to restart all of them again (and potentially lose all the progress on any expensive operations that were running pre-deployment). ### Related issues _No response_ ### Are you willing to submit a PR? - [ ] Yes I am willing to submit a PR! ### Code of Conduct - [X] I agree to follow this project's [Code of Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
