Hi Guys,

Doing a month backfill for all the pipelines has brought up some issues,
which we may not have noticed before.

One of the issues I am seeing is:
We use airflow pools.
>From what I currently see in the UI,  we have a pool named, say, "pool_1"
which has "Queued Slots" = 30
and Used Slots = 5.
Also, total available Slots = 30.
So this means, that next time the scheduler heartbeats,  atleast 25 tasks
should be moved to occupy the "Unused Slots", right?

The heartbeats have been set to very low.
job_heartbeat_sec = 2
scheduler_heartbeat_sec = 2

Originally, I had them both at 10 sec. But I am kinda irritated on how
slow things have been.

Strictly speaking from a scheduler view, the scheduling should move
the jobs from
"Queued" to "Running" (and occupy a "Used" slot)  in every 2 seconds
(scheduler_heartbeat_sec).


This are the parallelism numbers I am using:

parallelism = 64
dag_concurrency = 64
max_active_runs_per_dag = 16

I have not seen 64 tasks running at the sametime yet, although I have seen
around 40-50 being in "Queued" state. But they just not rollover to
"running" when the next heartbeat arrives.


There are around 10 hourly pipelines each with around 15 tasks.
It is progressing at a pace of 600 tasks per hour.
I would totally want to get this number to 60,000/hour.
Was hoping to complete the backfill within a day or two. But I think
this is going to take a week.


I looked at backend services:
They are mostly sitting idle for minutes (sometimes 5 minutes)
before they get a request.

I am not sure if my configurations are right.
Has someone faced this before? Any suggestions for me?

Currently, one of the bottlenecks I am observing is the time taken
from moving a task
from "Queued" -> "Used"  stage (in the pool page of UI).


Thanks,
Harish

Reply via email to