sgrzemski-ias edited a comment on issue #10790: URL: https://github.com/apache/airflow/issues/10790#issuecomment-690223705
@yuqian90 > I also tried to tweak these parameters. They don't seem to matter much as far as this error is concerned: > > ``` > parallelism = 1024 > dag_concurrency = 128 > max_threads = 8 > ``` > > The way to reproduce this issue seems to be to create a DAG with a bunch of parallel `reschedule` sensors. And make the DAG slow to import. For example, like this. If we add a `time.sleep(30)` at the end to simulate the experience of slow-to-import DAGs, this error happens a lot for such sensors. You may also need to tweak the `dagbag_import_timeout` and `dag_file_processor_timeout` if adding the `sleep` causes dags to fail to import altogether. Those parameters won't help you much. I was struggling to somehow workaround this issue and I believe I've found the right solution now. In my case the biggest hint while debugging was not scheduler/worker logs but the Celery Flower Web UI. We have a setup of 3 Celery workers, 4 CPU each. It often happened that Celery was running 8 or more python reschedule sensors on one worker but 0 on the others and that was the exact time when sensors started to fail. There are two Celery settings that are responsible for this behavior: ```worker_concurrency``` with a default value of "16" and ```worker_autoscale``` with a default value of "16,12" (it basically means that minimum Celery process # on the worker is 12 and can be scaled up to 16). With those set with default values Celery was configured to load up to 16 tasks (mainly reschedule sensors) to one node. After setting ```worker_concurrency``` to match the CPU number and ```worker_autoscale``` to "4,2" the problem is **literally gone** . Maybe that might be anothe clue for @turbaszek. I've been trying a lot to setup a local docker compose file with scheduler, webserver, flower, postgres and RabbitMQ as a Celery backend but I was not able to replicate the issue as well. I tried to start a worker container with limited CPU to somehow imitate this situation, but I failed. There are in fact tasks killed and shown as failed in Celery Flower, but not with the "killed externally" reason. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
