DuanWeiFan commented on issue #39717:
URL: https://github.com/apache/airflow/issues/39717#issuecomment-2233759113

   We had this issue after we upgraded to 2.9.2, but we found a fix yesterday.
   
   **TLDR' -
      Increase AIRFLOW__CELERY__WORKER_CONCURRENCY from 16 -> 64 fixed the 
issue.**
   
   Configuration with Celery Executor -
   ```
     Airflow - 2.9.2
       2 Schedulers
     Celery - 3.7.2
       3 Celery Workers
     Broker - Redis
     Number of Dags we run - ~200+ dags running at 1 UTC
   ```
   
   **The root cause for us was on the parallelism configuration. We have 
`AIRFLOW__CORE__PARALLELISM = 128`.
   However, we forgot to increase the `AIRFLOW__CELERY__WORKER_CONCURRENCY` 
which was still at default `16`.
   Given 2 scheduler with 128 parallelism and 3 workers with 16 concurrency, 
there are way more tasks getting stuck in the queue than the capacity our 
Celery Workers can handle. (2 * 128 v.s 3 * 16).**
   
   **We also set up Celery Flower UI to help debug on this issue -** 
   <img width="2553" alt="image-2024-07-16-13-05-54-367" 
src="https://github.com/user-attachments/assets/0ba37b40-181a-4265-99b0-a681d1340762";>
   There are two runs for this job, the second entry was the one that result in 
the error message. We found that the task only got received by Celery Worker at 
01:13:29 UTC even though the job was scheduled on 01:00 UTC. And it got 
completed within 4.93 seconds.
   The scheduler mark the task as failure (up-for-retry) at 01:13:07 UTC 
(~default 600 sec + 120 sec check interval), thus we are guessing even though 
Celery Worker did eventually pick up the task at 01:13:29 UTC, it got revoked 
right away as Scheduler already marked the task as failure. As a result we came 
to the conclusion that our Workers are not having enough concurrency to pick up 
tasks fast enough from the queue.
   
   
   ----
   Although this seems like a configuration problem on our end regardless of 
Airflow version, we did not see this error happening in the previous Airflow 
(<2.9.1). We're not sure why this is happening only on the latest Airflow 
version. Perhaps something changed on the scheduler check if a task is stuck in 
queue?
   
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to