Overbryd edited a comment on issue #13542:
URL: https://github.com/apache/airflow/issues/13542#issuecomment-945785740
I will unsubscribe from this issue.
I have not encountered this issue again (Airflow 2.1.2).
But the following circumstances made this issue pop up again:
* An Airflow DAG/task is being scheduled (not queued yet)
* The Airflow DAG code is being updated, but it contains an error, so that
the scheduler cannot load the code and the task that starts up exits immediately
* Now a rare condition takes place: A task is scheduled, but not yet
executed.
* The same task will boot a container
* The same task will exit immediately, because the container loads the
faulty code and crashes without bringing up the task at all.
* No failure is recorded on the task.
* Then the scheduler thinks the task is queued, but the task crashed
immediately (using KubernetesExecutor)
* This leads to queued slots filling up over time.
* Once all queued slots of a pool (or the default pool) are filled with
queued (but never executed, immediately crashing) tasks, the scheduler and the
whole system gets stuck.
How do I prevent this issue?
I simply make sure the DAG code is 100% clean and loads both in the
scheduler and the tasks that start up (using KubernetesExecutor).
How do I recover from this issue?
First, I fix the issue that prevents the DAG code from loading. I restart
the scheduler.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]