shahar1 commented on issue #39791: URL: https://github.com/apache/airflow/issues/39791#issuecomment-3361138216
> Hi [@shahar1](https://github.com/shahar1)! Thank you for taking the time to look into this, and I'm glad that another good has come out of this as well. > > Before I begin, please don't mind the wall of text. I promise it's a rather light read. > > I spun up the official Airflow Helm chart on my local machine to test this out, but unfortunately my pod still died unannounced. Here are my `values.yml`. The repo referenced for the Dag is public. > > After triggering the Dag and letting it run for a bit, I killed the scheduler pod (not forced). Below are logs from the scheduler pod - nothing too interesting IMO: > > Immediately, the operator's pod went into the Terminating state as well, but it kept ticking away for the amount configured in `termination_grace_period`. No logs related to termination appeared, the log stream simply ended. > > Now, here's the kicker: if you compare the operator pod's metadata immediately before and after killing the scheduler(worker), this is what changes: > > metadata: > creationTimestamp: "2025-10-01T12:56:44Z" > - generation: 1 > + deletionGracePeriodSeconds: 60 > + deletionTimestamp: "2025-10-01T12:58:24Z" > + generation: 2 > > Indeed, [this is part of the on_kill call inside the operator](https://github.com/apache/airflow/blob/3.1.0/providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/operators/pod.py#L1223). It actually sets the `deletionGracePeriodSeconds` to the value of `termination_grace_period` in the operator kwargs, then kills the pod. (I believe like this kwarg shouldn't be used for this, but I'll avoid delving into that now.) > > Anyway. What I did to solve this is subclass the KPO and override the `on_kill` method to only print a message and return. That appears to have fixed my issue. Detailed logs from the ["fixed" version of the Dag](https://github.com/ralichkov/dag-tests/blob/e6abbc8c8f96d918e94450a28d94a7fb48ed1b57/kpo.py#L5-L11): > > Of course this "hotfix" breaks other expected functionality like the pod interrupting after "mark failed", etc. > > On my first view attempts it would actually not resume the task properly, leading to weird stuff like this: <img alt="Image" width="165" height="121" src="https://private-user-images.githubusercontent.com/185727176/496604356-c0ced136-cd0c-4c79-b8c7-916b863dcad0.png?jwt=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NTk0MTAxNDUsIm5iZiI6MTc1OTQwOTg0NSwicGF0aCI6Ii8xODU3MjcxNzYvNDk2NjA0MzU2LWMwY2VkMTM2LWNkMGMtNGM3OS1iOGM3LTkxNmI4NjNkY2FkMC5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUxMDAyJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MTAwMlQxMjU3MjVaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1iZTRiZjVmOWYxNzZkZmEzZDM5YzEzZjQ1Yjk5NDU1NDZhZWJhYTgxNzNhOTM3OGIzZWYwYTBhMjdiNjg0ZjY2JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.i7kbsGofL4efz6btLE6cgy7m91G-6vpZWSfA6uBoVlU"> Whether this is something to do on my end or another bu g I don't know, just wanted to put it out there. > > Overall, I don't see how this reattachment behavior is supposed to work in the real world if `on_kill` is always called when the worker dies - that always kills the pod. What is it supposed to reattach to? I don't understand how your breeze setup "works". Maybe that's how it was rolled out as well. > > What I would propose is the following: modify `_on_term()` to propagate the trapped signal or frame (or both?), then pass that down to `ti.task.on_kill()`. Then, KPO can make the decision to **not** kill the operating pod if `reattach_on_restart` is enabled. > > I will look into setting up breeze locally and see what I can do about this. Thanks for the detailed reproduction steps! So it seems indeed like an issue with the `KuberetesExecutor` specifically. I hope that after configuring k8s to work with breeze natively it will be easier for me to reproduce it as well (if you want to work on a solution for that - do not wait for me) :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
