We have massively re-worked (and benchmarked) verify_integrity as part
of the HA work (including using a dummy sample of your large DAG
structure provided by Kevin) since the 1.10.4 version, and it is no
longer the bottleneck it once was. From memory this was mostly fixed
around 1.10.12 by improving the queries issued.
We have done performance benchmarks of 1000 concurrent dags with 1000
tasks each and verify_integrity barely showed up on the profile.
-ash
On Thu, Dec 16 2021 at 21:40:47 -0800, Ping Zhang <[email protected]>
wrote:
Hi Airflow community,
While reading the airflow latest main branch, I noticed that the dag
run creation including the ti creation in (verify_integrity) was
moved to the scheduling loop (in the _do_scheduling) from the
`DagFileProcessorManager` loop. I would like to learn more about the
context behind this.
Since in your production (Airbnb), we have a metric to show that this
`verify_integrity` is very expensive for new dag runs, it can take
~47 seconds for our large dag (~20K tasks, we have a few dozen of
dags reaching this number) for a single dag run with aws
db.r5.16xlarge. Even though we have optimized it down to ~17 seconds
(We will open source this soon), it is still very expensive.
This will greatly hurt the scheduling performance and lower the
overall throughput for large clusters. Creating dag runs for all
dags_needing_dagruns in the scheduling loop can exacerbate the
scheduling delay even if NUM_DAGS_PER_DAGRUN_QUERY is configurable.
I would like to chat more about this.
Best wishes
Ping Zhang