yeachan153 opened a new issue #21900:
URL: https://github.com/apache/airflow/issues/21900


   ### Description
   
   The `kubernetes_pod_operator` currently has a `reattach_on_restart` 
parameter that attempts to reattach to running pods instead of creating a new 
pod in case a scheduler dies while the task is running.
   
   We would like for this feature to also work when the worker dies as well. 
Currently, a dying worker receives a SIGTERM and triggers the `on_kill` method:
   
https://github.com/apache/airflow/blob/ace8c6e942ff5554639801468b971915b7c0e9b9/airflow/models/taskinstance.py#L1425
   
   This ends up deleting the pod that was created:
   
https://github.com/apache/airflow/blob/ace8c6e942ff5554639801468b971915b7c0e9b9/airflow/providers/cncf/kubernetes/operators/kubernetes_pod.py#L438
   
   We currently got around this problem by removing the the `on_kill` call upon 
receiving a SIGTERM and pushing an xcom indicating that the worker was killed. 
We then enabled retries for the `kubernetes_pod_operator` and modified the 
[is_eligible_to_retry](https://github.com/apache/airflow/blob/ace8c6e942ff5554639801468b971915b7c0e9b9/airflow/models/taskinstance.py#L1825)
 function to check for the presence of this xcom and only retry if found, 
allowing us to retry only when the worker was killed.
   
   Unfortunately, this is not a perfect solution because clearing a task / 
stopping a task via the UI triggers the same signal handler as when a worker is 
killed externally. Therefore, stopping the task now does not kill the pod, and 
clearing the task causes a reattach when we would ideally like a restart.
   
   
   ### Use case/motivation
   
   Since the pod itself may fail for a valid reason, we don't just want to add 
more retries. In that situation, it will also not re-attach but start a 
completely new pod since the original pod would have been cleaned up.
   
   We specifically want the reattaching to happen when the worker dies for 
infrastructure related reasons. This is useful for instance, during deployment 
updates in kubernetes. It's currently quite a disruptive process because all 
the running pods are first killed, and if retries are not enabled (for reasons 
mentioned above), we have to restart all of them again (and potentially lose 
all the progress on any expensive operations that were running pre-deployment).
   
   ### Related issues
   
   _No response_
   
   ### Are you willing to submit a PR?
   
   - [ ] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [X] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to