[GitHub] [airflow] Jorricks commented on issue #13542: Task stuck in "scheduled" or "queued" state, pool has all slots queued, nothing is executing

GitBox Tue, 15 Jun 2021 01:31:59 -0700


Jorricks commented on issue #13542:
URL: https://github.com/apache/airflow/issues/13542#issuecomment-861300533



   I currently work for a company where we have 100+ DAGS running with 
approximately a 1000 tasks.
   We were running into a similar issue.
   
   I figured out there is a bug in the `celery_executor` which I still want to 
fix myself and contribute.
   Summary of that problem:
   At the start of the scheduler, the celery_executor class instance of the 
scheduler picks up everything from 'dead' schedulers (your previous run). That 
is (if you run one scheduler) every TaskInstance in the Running, Queued or 
Scheduled state. Then once it verified that this task is not running(takes 10 
minutes), it clears most of the references but forgets a crucial one, making it 
such that the scheduler can NEVER start this task anymore. You can still start 
it via the webserver because that has its own celery_executor class instance.
   
   What we noticed:
   - Many tasks were very slowly to be scheduled even though the workers were 
almost fully idle.
   - The TaskInstances were stuck on Queued or Scheduled.
   - Restarting the scheduler didn't work.
   
   What you can do to verify whether you have the same issue:
   - Stop the scheduler
   - Clear all TaskInstances that are Queued or Scheduled
   - Start the scheduler
   
   Our fix:
   - Increase the airflow.cfg parallelism -> from 32 to 320. This is what 
deadlocks your scheduler in that case.
   - Increase the default pool size (for a speedup) -> from 128 to 1000
   - For any task that the scheduler can't run anymore. Do the procedure 
mentioned above or kick-start it yourself by clicking the task instance 
followed by "Ignore all deps", "Ignore Task states", "Ignore Task Deps" and 
finally "Run".
   
   Hope this helps anyone and saves you a couple days of debugging :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [airflow] Jorricks commented on issue #13542: Task stuck in "scheduled" or "queued" state, pool has all slots queued, nothing is executing

Reply via email to