sean-rose opened a new issue, #49466:
URL: https://github.com/apache/airflow/issues/49466

   ### Apache Airflow version
   
   2.10.5
   
   ### If "Other Airflow 2 version" selected, which one?
   
   _No response_
   
   ### What happened?
   
   When using the `KubernetesPodOperator` with `on_finish_action` set to 
`"keep_pod"`, if the task fails in a way where the pod is still running at the 
time (e.g. the task's specified `execution_timeout` is exceeded, or the pod 
takes longer than `startup_timeout_seconds` to start) then the pod is left 
running.
   
   ### What you think should happen instead?
   
   IMO if a task fails its associated workload should always be stopped, not 
left running. Or failing that, it should at least be possible to easily 
configure the `KubernetesPodOperator` to clean up running pods without always 
deleting all pods.
   
   If Kubernetes provided a way to stop running pods without entirely deleting 
them that'd be the ideal solution, but unfortunately that doesn't appear to be 
possible, so I see a few other options:
   1. Automatically delete running pods during cleanup.
      * This would be my preference, as I don't think it'd ever be desirable to 
leave pods running when the associated task has failed (e.g. due to the task 
execution timeout).
   2. Allow the behavior to be configured, either via `on_finish_action` or a 
new parameter specific to cleaning up running pods.
      * Though if we just add other options for `on_finish_action` we'd be 
intentionally leaving people using the existing `"keep_pod"` and 
`"delete_succeeded_pod"` options in the existing buggy state where running pods 
can be left after cleanup.
   3. Leave it up to users to implement their own pod cleanup logic via 
`on_pod_cleanup` callbacks.
      * This isn't currently feasible since `on_pod_cleanup` callbacks aren't 
called for failed tasks, but I've submitted #49441 to change that behavior.
   
   I'm happy to submit a PR to implement any of these options, but would need 
guidance from the Kubernetes provider maintainers on what the preferred 
approach should be.
   
   ### How to reproduce
   
   Configure a `KubernetesPodOperator` with `on_finish_action="keep_pod"` and 
an `execution_timeout` shorter than its runtime:
   ```python
   KubernetesPodOperator(
       task_id='pod_task_timeout_test',
       name='pod-task-timeout-test',
       image='alpine',
       cmds=['/bin/sh'],
       arguments=['-c', 'sleep 300'],
       execution_timeout=datetime.timedelta(seconds=10),
       on_finish_action="keep_pod",
       ...
   )
   ```
   
   When the task fails due to the execution timeout the pod will be left 
running.
   
   ### Operating System
   
   Debian GNU/Linux 12 (bookworm)
   
   ### Versions of Apache Airflow Providers
   
   ```
   apache-airflow-providers-cncf-kubernetes==10.1.0
   ```
   
   ### Deployment
   
   Official Apache Airflow Helm Chart
   
   ### Deployment details
   
   _No response_
   
   ### Anything else?
   
   _No response_
   
   ### Are you willing to submit PR?
   
   - [x] Yes I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to