sean-rose opened a new issue, #49466:
URL: https://github.com/apache/airflow/issues/49466
### Apache Airflow version
2.10.5
### If "Other Airflow 2 version" selected, which one?
_No response_
### What happened?
When using the `KubernetesPodOperator` with `on_finish_action` set to
`"keep_pod"`, if the task fails in a way where the pod is still running at the
time (e.g. the task's specified `execution_timeout` is exceeded, or the pod
takes longer than `startup_timeout_seconds` to start) then the pod is left
running.
### What you think should happen instead?
IMO if a task fails its associated workload should always be stopped, not
left running. Or failing that, it should at least be possible to easily
configure the `KubernetesPodOperator` to clean up running pods without always
deleting all pods.
If Kubernetes provided a way to stop running pods without entirely deleting
them that'd be the ideal solution, but unfortunately that doesn't appear to be
possible, so I see a few other options:
1. Automatically delete running pods during cleanup.
* This would be my preference, as I don't think it'd ever be desirable to
leave pods running when the associated task has failed (e.g. due to the task
execution timeout).
2. Allow the behavior to be configured, either via `on_finish_action` or a
new parameter specific to cleaning up running pods.
* Though if we just add other options for `on_finish_action` we'd be
intentionally leaving people using the existing `"keep_pod"` and
`"delete_succeeded_pod"` options in the existing buggy state where running pods
can be left after cleanup.
3. Leave it up to users to implement their own pod cleanup logic via
`on_pod_cleanup` callbacks.
* This isn't currently feasible since `on_pod_cleanup` callbacks aren't
called for failed tasks, but I've submitted #49441 to change that behavior.
I'm happy to submit a PR to implement any of these options, but would need
guidance from the Kubernetes provider maintainers on what the preferred
approach should be.
### How to reproduce
Configure a `KubernetesPodOperator` with `on_finish_action="keep_pod"` and
an `execution_timeout` shorter than its runtime:
```python
KubernetesPodOperator(
task_id='pod_task_timeout_test',
name='pod-task-timeout-test',
image='alpine',
cmds=['/bin/sh'],
arguments=['-c', 'sleep 300'],
execution_timeout=datetime.timedelta(seconds=10),
on_finish_action="keep_pod",
...
)
```
When the task fails due to the execution timeout the pod will be left
running.
### Operating System
Debian GNU/Linux 12 (bookworm)
### Versions of Apache Airflow Providers
```
apache-airflow-providers-cncf-kubernetes==10.1.0
```
### Deployment
Official Apache Airflow Helm Chart
### Deployment details
_No response_
### Anything else?
_No response_
### Are you willing to submit PR?
- [x] Yes I am willing to submit a PR!
### Code of Conduct
- [x] I agree to follow this project's [Code of
Conduct](https://github.com/apache/airflow/blob/main/CODE_OF_CONDUCT.md)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]