Hi Guys, Doing a month backfill for all the pipelines has brought up some issues, which we may not have noticed before.
One of the issues I am seeing is: We use airflow pools. >From what I currently see in the UI, we have a pool named, say, "pool_1" which has "Queued Slots" = 30 and Used Slots = 5. Also, total available Slots = 30. So this means, that next time the scheduler heartbeats, atleast 25 tasks should be moved to occupy the "Unused Slots", right? The heartbeats have been set to very low. job_heartbeat_sec = 2 scheduler_heartbeat_sec = 2 Originally, I had them both at 10 sec. But I am kinda irritated on how slow things have been. Strictly speaking from a scheduler view, the scheduling should move the jobs from "Queued" to "Running" (and occupy a "Used" slot) in every 2 seconds (scheduler_heartbeat_sec). This are the parallelism numbers I am using: parallelism = 64 dag_concurrency = 64 max_active_runs_per_dag = 16 I have not seen 64 tasks running at the sametime yet, although I have seen around 40-50 being in "Queued" state. But they just not rollover to "running" when the next heartbeat arrives. There are around 10 hourly pipelines each with around 15 tasks. It is progressing at a pace of 600 tasks per hour. I would totally want to get this number to 60,000/hour. Was hoping to complete the backfill within a day or two. But I think this is going to take a week. I looked at backend services: They are mostly sitting idle for minutes (sometimes 5 minutes) before they get a request. I am not sure if my configurations are right. Has someone faced this before? Any suggestions for me? Currently, one of the bottlenecks I am observing is the time taken from moving a task from "Queued" -> "Used" stage (in the pool page of UI). Thanks, Harish
