yuqian90 opened a new pull request #11010:
URL: https://github.com/apache/airflow/pull/11010


   This is cherry-picked from  #4751 for `v1-10-test`.
   
   Some investigation into the issue reported in #10790 led to the discovery 
that this loop in `scheduler_job.py` takes almost 90% of the time in 
`SchedulerJob.process_file()` for large DAGs (around 500 tasks). This causes 
the `DagFileProcessor` spawned by the scheduler to go slowly. The reason this 
loop is slow is that it creates a new `DepContext` for every `ti`. And every 
`DepContext` needs to populate its own `finished_tasks` even though this list 
is the same for every `DagRun`.
   
   ```python
               for ti in tis:
                   task = dag.get_task(ti.task_id)
   
                   # fixme: ti.task is transient but needs to be set
                   ti.task = task
   
                   if ti.are_dependencies_met(
                           dep_context=DepContext(flag_upstream_failed=True),
                           session=session):
                       self.log.debug('Queuing task: %s', ti)
                       task_instances_list.append(ti.key)
   ```
   
   This is the flamegraph generated by `py-spy` showing the performance of 
`DagFileProcessor` in Airflow 1.10.12 before this PR:
   
https://raw.githubusercontent.com/yuqian90/airflow/gif_for_demo/airflow/www/static/flamegraph_before.svg
   
   This is the performance after 1.10.12 is patched with this PR:
   
https://raw.githubusercontent.com/yuqian90/airflow/gif_for_demo/airflow/www/static/flamegraph_after.svg
   
   The nice thing is that #4751 already addressed this issue for master branch. 
We just need to cherry-pick it to fix this in 1.10.* with some very minor 
conflict fixes.
   
   While this PR will not fix every scenario that causes #10790, it does reduce 
the `DagFileProcessor` time from around 100s to just about 12s for our use case 
(a DAG with about 500 tasks, many of them are sensors in `reschedule` mode with 
`poke_interval` 60s.). 
   
   
   Original commit message in #4751:
   
   This decreases scheduler delay between tasks by about 20% for larger DAGs,
   sometimes more for larger or more complex DAGs.
   
   The delay between tasks can be a major issue, especially when we have dags 
with
   many subdags, figures out that the scheduling process spends plenty of time 
in
   dependency checking, we took the trigger rule dependency which calls the db 
for
   each task instance, we made it call the db just once for each dag_run
   
   (cherry picked from commit 50efda5c69c1ddfaa869b408540182fb19f1a286)
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to