Jorricks edited a comment on issue #13542:
URL: https://github.com/apache/airflow/issues/13542#issuecomment-861300533
We were running into a similar issue as we have 100+ dags and around a 1000
tasks.
I figured out there is a bug in the `celery_executor` which I still want to
fix myself and contribute.
Summary of that problem:
At the start of the scheduler, the celery_executor class instance of the
scheduler picks up everything from 'dead' schedulers (your previous run). That
is (if you run one scheduler) every TaskInstance in the Running, Queued or
Scheduled state. Then once it verified that this task is not running(takes 10
minutes), it clears most of the references but forgets a crucial one, making it
such that the scheduler can NEVER start this task anymore. You can still start
it via the webserver because that has its own celery_executor class instance.
What we noticed:
- Many tasks were very slowly to be scheduled even though the workers were
almost fully idle.
- The TaskInstances were stuck on Queued or Scheduled.
- Restarting the scheduler didn't work.
- Once restarted (with debug logging enabled) you'd get a logging line like
this: `[2021-06-14 14:07:31,932] {base_executor.py:152} DEBUG - -62 open slots`
What you can do to verify whether you have the same issue:
- Stop the scheduler
- Clear all TaskInstances that are Queued or Scheduled
- Start the scheduler
Our fix:
- Increase the airflow.cfg parallelism -> from 32 to 1000. This is what
could easily deadlock your scheduler after a restart. Because it uses this
variable to see if it can launch any new task. If you had 50 tasks in Scheduled
waiting, it will deadlock your entire scheduler.
- Increase the default pool size (for a speedup) -> from 128 to 1000
- For any task that the scheduler can't run anymore. Do the procedure
mentioned above or kick-start it yourself by clicking the task instance
followed by "Ignore all deps", "Ignore Task states", "Ignore Task Deps" and
finally "Run".
Hope this helps anyone and saves you a couple days of debugging :)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]