Re: [I] reattach_on_restart doesn't work properly as exepected [airflow]

via GitHub Thu, 02 Oct 2025 06:02:34 -0700


shahar1 commented on issue #39791:
URL: https://github.com/apache/airflow/issues/39791#issuecomment-3361138216


   > Hi [@shahar1](https://github.com/shahar1)! Thank you for taking the time 
to look into this, and I'm glad that another good has come out of this as well.
   > 
   > Before I begin, please don't mind the wall of text. I promise it's a 
rather light read.
   > 
   > I spun up the official Airflow Helm chart on my local machine to test this 
out, but unfortunately my pod still died unannounced. Here are my `values.yml`. 
The repo referenced for the Dag is public.
   > 
   > After triggering the Dag and letting it run for a bit, I killed the 
scheduler pod (not forced). Below are logs from the scheduler pod - nothing too 
interesting IMO:
   > 
   > Immediately, the operator's pod went into the Terminating state as well, 
but it kept ticking away for the amount configured in 
`termination_grace_period`. No logs related to termination appeared, the log 
stream simply ended.
   > 
   > Now, here's the kicker: if you compare the operator pod's metadata 
immediately before and after killing the scheduler(worker), this is what 
changes:
   > 
   >  metadata:
   >    creationTimestamp: "2025-10-01T12:56:44Z"
   > -  generation: 1
   > +  deletionGracePeriodSeconds: 60
   > +  deletionTimestamp: "2025-10-01T12:58:24Z"
   > +  generation: 2
   > 
   > Indeed, [this is part of the on_kill call inside the 
operator](https://github.com/apache/airflow/blob/3.1.0/providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/operators/pod.py#L1223).
 It actually sets the `deletionGracePeriodSeconds` to the value of 
`termination_grace_period` in the operator kwargs, then kills the pod. (I 
believe like this kwarg shouldn't be used for this, but I'll avoid delving into 
that now.)
   > 
   > Anyway. What I did to solve this is subclass the KPO and override the 
`on_kill` method to only print a message and return. That appears to have fixed 
my issue. Detailed logs from the ["fixed" version of the 
Dag](https://github.com/ralichkov/dag-tests/blob/e6abbc8c8f96d918e94450a28d94a7fb48ed1b57/kpo.py#L5-L11):
   > 
   > Of course this "hotfix" breaks other expected functionality like the pod 
interrupting after "mark failed", etc.
   > 
   > On my first view attempts it would actually not resume the task properly, 
leading to weird stuff like this: <img alt="Image" width="165" height="121" 
src="https://private-user-images.githubusercontent.com/185727176/496604356-c0ced136-cd0c-4c79-b8c7-916b863dcad0.png?jwt=eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NTk0MTAxNDUsIm5iZiI6MTc1OTQwOTg0NSwicGF0aCI6Ii8xODU3MjcxNzYvNDk2NjA0MzU2LWMwY2VkMTM2LWNkMGMtNGM3OS1iOGM3LTkxNmI4NjNkY2FkMC5wbmc_WC1BbXotQWxnb3JpdGhtPUFXUzQtSE1BQy1TSEEyNTYmWC1BbXotQ3JlZGVudGlhbD1BS0lBVkNPRFlMU0E1M1BRSzRaQSUyRjIwMjUxMDAyJTJGdXMtZWFzdC0xJTJGczMlMkZhd3M0X3JlcXVlc3QmWC1BbXotRGF0ZT0yMDI1MTAwMlQxMjU3MjVaJlgtQW16LUV4cGlyZXM9MzAwJlgtQW16LVNpZ25hdHVyZT1iZTRiZjVmOWYxNzZkZmEzZDM5YzEzZjQ1Yjk5NDU1NDZhZWJhYTgxNzNhOTM3OGIzZWYwYTBhMjdiNjg0ZjY2JlgtQW16LVNpZ25lZEhlYWRlcnM9aG9zdCJ9.i7kbsGofL4efz6btLE6cgy7m91G-6vpZWSfA6uBoVlU";>
 Whether this is something to do on my end or another bu
 g I don't know, just wanted to put it out there.
   > 
   > Overall, I don't see how this reattachment behavior is supposed to work in 
the real world if `on_kill` is always called when the worker dies - that always 
kills the pod. What is it supposed to reattach to? I don't understand how your 
breeze setup "works". Maybe that's how it was rolled out as well.
   > 
   > What I would propose is the following: modify `_on_term()` to propagate 
the trapped signal or frame (or both?), then pass that down to 
`ti.task.on_kill()`. Then, KPO can make the decision to **not** kill the 
operating pod if `reattach_on_restart` is enabled.
   > 
   > I will look into setting up breeze locally and see what I can do about 
this.
   
   Thanks for the detailed reproduction steps!
   So it seems indeed like an issue with the `KuberetesExecutor` specifically. 
I hope that after configuring k8s to work with breeze natively it will be 
easier for me to reproduce it as well (if you want to work on a solution for 
that - do not wait for me) :)
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] reattach_on_restart doesn't work properly as exepected [airflow]

Reply via email to