repl-chris commented on PR #19769:
URL: https://github.com/apache/airflow/pull/19769#issuecomment-1113721608

   @ephraimbuddy we can reproduce this reliably in an isolated test 
environment, and would love to help test this fix to help get it merged up.  
We're running with a redis broker and pgsql results db, on k8s with KEDA auto 
scaling - it happens when under load during a scale-in event (shutting down a 
worker)
   
   We've spent a bit of time tracking it down and (at least in our case) it 
looks to be a problem in celery (possibly 
https://github.com/celery/celery/issues/7266). Airflow throws the task at 
celery and it just never executes, never makes it into the `celery_taskmeta` 
table. Airflow gets a celery task_id back and assumes its running, polling 
celery for updated status returns `celery_states.PENDING` indefinitely, as it 
has vanished and that's celery's default response for an unknown task. It seems 
like celery's warm shutdown on SIGTERM somehow loses tasks sometimes. We've run 
some tests with [this patch](https://paste.ee/p/YdNAl) and it seems to recover 
from this state nicely - I believe it's (almost exactly) what you had 
implemented in [pull 21556](https://github.com/apache/airflow/pull/21556).  
   
   I do have some concerns around `self.task` - in your implementation when 
setting the task back to `SCHEDULED` it is never cleared out of `self.task`, so 
if the re-scheduled task happens to be picked up by a different scheduler 
instance it will just stay in there forever (and in `self.running` come to 
think of it - they'd potentially accumulate over time and eat up all the open 
slots). I think it would be desirable to add a filter so each executor will 
only "re-schedule" tasks which it "owns" - this way the executor can clean up 
its own internal structures at the same time....and any tasks which are not 
owned by a running executor will get picked up by the adoption code and then 
"re-scheduled" by the new owner.  I tried taking a stab at that change but it 
didn't quite work and I haven't figured out why yet....so maybe there's 
something going on I don't fully understand, I'm going to keep poking around at 
it though.  Let me know if I should open a new PR, or if you'd like to resurre
 ct your existing one...


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to