Jorricks edited a comment on issue #13542:
URL: https://github.com/apache/airflow/issues/13542#issuecomment-861300533


   We were running into a similar issue as we have 100+ dags and around a 1000 
tasks.
   
   I figured out there is a bug in the `celery_executor` which I still want to 
fix myself and contribute.
   
   Summary of that problem:
   At the start of the scheduler, the celery_executor class instance of the 
scheduler picks up everything from 'dead' schedulers (your previous run). That 
is (if you run one scheduler) every TaskInstance in the Running, Queued or 
Scheduled state. Then once it verified that this task is not running(takes 10 
minutes), it clears most of the references but forgets a crucial one, making it 
such that the scheduler can NEVER start this task anymore. You can still start 
it via the webserver because that has its own celery_executor class instance.
   
   What we noticed:
   - Many tasks were very slowly to be scheduled even though the workers were 
almost fully idle.
   - The TaskInstances were stuck on Queued or Scheduled.
   - Restarting the scheduler didn't work.
   - Once restarted (with debug logging enabled) you'd get a logging line 
indicating you have negative open slots: `[2021-06-14 14:07:31,932] 
{base_executor.py:152} DEBUG - -62 open slots`
   
   What you can do to verify whether you have the same issue:
   - Stop the scheduler
   - Clear all TaskInstances that are Queued or Scheduled
   - Start the scheduler
   
   Our fix:
   - Increase the airflow.cfg parallelism -> from 32 to 1000. This is what 
could easily deadlock your scheduler after a restart. Because it uses this 
variable to see if it can launch any new task. If you had 50 tasks in Scheduled 
waiting, it will deadlock your entire scheduler.
   - Increase the default pool size (for a speedup) -> from 128 to 1000
   - For any task that the scheduler can't run anymore. Do the procedure 
mentioned above or kick-start it yourself by clicking the task instance 
followed by "Ignore all deps", "Ignore Task states", "Ignore Task Deps" and 
finally "Run".
   
   Hope this helps anyone and saves you a couple days of debugging :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to