andrewgodwin opened a new pull request #18152:
URL: https://github.com/apache/airflow/pull/18152


   There are a set of circumstances where TaskInstances can get "stuck" in the 
QUEUED state when they are running under KubernetesExecutor, where they claim 
to have a pod scheduled (and so are queued) but do not actually have one, and 
so sit there forever.
   
   It appears this happens occasionally with reschedule sensors and now more 
often with deferrable tasks, when the task instance defers/reschedules and then 
resumes before the old pod has vanished. It would also, I believe, happen when 
the Executor hard-exits with items still in its internal queues.
   
   There was a pre-existing method in there to clean up stuck queued tasks, but 
it only ran once, on executor start. I have modified it to be safe to run 
periodically (by teaching it not to touch things that the executor looked at 
recently), and then made it run every so often (30 seconds by default).
   
   This is not a perfect fix - the only real fix would be to have far more 
detailed state tracking as part of TaskInstance or another table, and 
re-architect the KubernetesExecutor. However, this should reduce the number of 
times this happens very signficantly, so it should do for now.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to