MatrixManAtYrService edited a comment on issue #13542: URL: https://github.com/apache/airflow/issues/13542#issuecomment-849031715
While trying to recreate this, I wrote a [stress test](https://github.com/MatrixManAtYrService/airflow-git-sync/blob/master/scheduler_stress.py) which I ran overnight on my local microk8s cluster (release:2.0.1+beb8af5ac6c438c29e2c186145115fb1334a3735 configured like [this](https://github.com/MatrixManAtYrService/airflow-git-sync/blob/master/zsh.stdin)). I was hoping that it would get fully stuck by the time I woke. Instead there were only two stuck tasks:  Deleting the scheduler pod and letting kubernetes recreate it caused the two stuck tasks to complete. At about 1:00 PM I cleared the state of all previous tasks. For a little while, the scheduler managed to both backfill the cleared tasks and keep up with scheduled runs, but then something happened that caused most of the tasks to get stuck. <img width="825" alt="Screen Shot 2021-05-26 at 9 58 19 PM" src="https://user-images.githubusercontent.com/5834582/119764748-34b73000-be6f-11eb-99b0-c481905db56b.png"> Things were still limping along after that, but I never again saw more than three tasks running at once. This time, resetting the scheduler pod did **not** remedy the situation--it just resumed to its prior anemic state. Here's a dump of the database and a snapshot of the scheduler logs right after a restart: [db_and_scheduler_logs.tar.gz](https://github.com/apache/airflow/files/6551081/db_and_scheduler_logs.tar.gz) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
