Hi Folks,

Previously I initiated a discussion about the best practice of Airflow 
setting-up, and it was agreed by a few folks that scheduler may become one of 
the bottleneck component (we can only run one scheduler instance, can only 
scale vertically rather than horizontally, etc.). Especially when we have 
thousands of DAGs, the scheduling latency may be high.

In our team, we have experimented a naive multiple-scheduler architecture. 
Would like to share here, and also seek inputs from you.

*1. Background*
- Inside DAG_Folder, we can have sub-folders.
- When we initiate scheduler instance, we can specify “--subdir” for it, which 
will specify the specific directory that the scheduler is going to “scan” 
(https://airflow.apache.org/cli.html#scheduler).

*2. Our Naive Idea*
Say we have 2,000 DAGs. If we run one single scheduler instance, one scheduling 
loop will traverse all 2K DAGs.

Our idea is:
Step-1: Create multiple sub-directories, say five, under DAG_Folder (subdir1, 
subdir2, …, subdir5)
Step-2: Distribute the DAGs evenly into these sub-directories (400 DAGs in each)
Step-3: then we can start scheduler instance on 5 different machines, using 
command `airflow scheduler --subdir subdir<i>` on machine <i>.

Hence eventually, each scheduler only needs to take care of 400 DAGs.

*3. Test & Results*
- We have done a testing using 2,000 DAGs (3 tasks in each DAG).
- DAGs are stored using network attached storage (the same drive mounted to all 
nodes), so we don’t concern about the DAG_Folder synchronization.
- No conflict observed (each DAG file will only be parsed & scheduled by one 
scheduler instance).
- The scheduling speed improves almost linearly. Demonstrated that we can scale 
scheduler horizontally.

*4. Highlight*
- This naive idea doesn’t address scheduler availability.
- As Kelvin Yang shared earlier in another thread, the database may be another 
bottleneck when the load is high. But this is not considered here yet.


Kindly share your thoughts on this naive idea. Thanks.



Best regards,
XD




Reply via email to