Jorricks commented on issue #13542: URL: https://github.com/apache/airflow/issues/13542#issuecomment-861300533
I currently work for a company where we have 100+ DAGS running with approximately a 1000 tasks. We were running into a similar issue. I figured out there is a bug in the `celery_executor` which I still want to fix myself and contribute. Summary of that problem: At the start of the scheduler, the celery_executor class instance of the scheduler picks up everything from 'dead' schedulers (your previous run). That is (if you run one scheduler) every TaskInstance in the Running, Queued or Scheduled state. Then once it verified that this task is not running(takes 10 minutes), it clears most of the references but forgets a crucial one, making it such that the scheduler can NEVER start this task anymore. You can still start it via the webserver because that has its own celery_executor class instance. What we noticed: - Many tasks were very slowly to be scheduled even though the workers were almost fully idle. - The TaskInstances were stuck on Queued or Scheduled. - Restarting the scheduler didn't work. What you can do to verify whether you have the same issue: - Stop the scheduler - Clear all TaskInstances that are Queued or Scheduled - Start the scheduler Our fix: - Increase the airflow.cfg parallelism -> from 32 to 320. This is what deadlocks your scheduler in that case. - Increase the default pool size (for a speedup) -> from 128 to 1000 - For any task that the scheduler can't run anymore. Do the procedure mentioned above or kick-start it yourself by clicking the task instance followed by "Ignore all deps", "Ignore Task states", "Ignore Task Deps" and finally "Run". Hope this helps anyone and saves you a couple days of debugging :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
