yuqian90 opened a new pull request #11010:
URL: https://github.com/apache/airflow/pull/11010
This is cherry-picked from #4751 for `v1-10-test`.
Some investigation into the issue reported in #10790 led to the discovery
that this loop in `scheduler_job.py` takes almost 90% of the time in
`SchedulerJob.process_file()` for large DAGs (around 500 tasks). This causes
the `DagFileProcessor` spawned by the scheduler to go slowly. The reason this
loop is slow is that it creates a new `DepContext` for every `ti`. And every
`DepContext` needs to populate its own `finished_tasks` even though this list
is the same for every `DagRun`.
```python
for ti in tis:
task = dag.get_task(ti.task_id)
# fixme: ti.task is transient but needs to be set
ti.task = task
if ti.are_dependencies_met(
dep_context=DepContext(flag_upstream_failed=True),
session=session):
self.log.debug('Queuing task: %s', ti)
task_instances_list.append(ti.key)
```
This is the flamegraph generated by `py-spy` showing the performance of
`DagFileProcessor` in Airflow 1.10.12 before this PR:
https://raw.githubusercontent.com/yuqian90/airflow/gif_for_demo/airflow/www/static/flamegraph_before.svg
This is the performance after 1.10.12 is patched with this PR:
https://raw.githubusercontent.com/yuqian90/airflow/gif_for_demo/airflow/www/static/flamegraph_after.svg
The nice thing is that #4751 already addressed this issue for master branch.
We just need to cherry-pick it to fix this in 1.10.* with some very minor
conflict fixes.
While this PR will not fix every scenario that causes #10790, it does reduce
the `DagFileProcessor` time from around 100s to just about 12s for our use case
(a DAG with about 500 tasks, many of them are sensors in `reschedule` mode with
`poke_interval` 60s.).
Original commit message in #4751:
This decreases scheduler delay between tasks by about 20% for larger DAGs,
sometimes more for larger or more complex DAGs.
The delay between tasks can be a major issue, especially when we have dags
with
many subdags, figures out that the scheduling process spends plenty of time
in
dependency checking, we took the trigger rule dependency which calls the db
for
each task instance, we made it call the db just once for each dag_run
(cherry picked from commit 50efda5c69c1ddfaa869b408540182fb19f1a286)
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]