[ https://issues.apache.org/jira/browse/AIRFLOW-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16167003#comment-16167003 ]
Gregory Benison edited comment on AIRFLOW-1614 at 9/14/17 9:44 PM: ------------------------------------------------------------------- Some profiling suggests this is largely due to the cost of constructing DAG objects; in particular the call to "inspect.stack" in the constructor is rather expensive. It's also a bit unnecessary - we only need the caller's frame, but inspect.stack gives you the whole stack. I have a [pull request|https://github.com/apache/incubator-airflow/pull/2610] with one possible fix that cuts the parsing time of the above file by more than half on my platform. was (Author: gbenison): Some profiling suggests this is largely due to the cost of constructing DAG objects; in particular the call to "inspect.stack" in the constructor is rather expensive. It's also a bit unnecessary - we only need the caller's frame, but inspect.stack gives you the whole stack. I have a pull request with one possible fix (coming soon) that cuts the parsing time of the above file by more than half on my platform. > Improve performance of DAG parsing when there are many subdags > -------------------------------------------------------------- > > Key: AIRFLOW-1614 > URL: https://issues.apache.org/jira/browse/AIRFLOW-1614 > Project: Apache Airflow > Issue Type: Improvement > Reporter: Gregory Benison > > DAGs can be very slow to parse when they contain many (100s or 1000s) of > subdags. This can be illustrated using the following DAG definition file: > {code}from datetime import datetime, timedelta > from airflow.models import DAG > from airflow.operators.dummy_operator import DummyOperator > from airflow.operators.subdag_operator import SubDagOperator > dag = DAG( > 'subdaggy-2', > schedule_interval=None, > start_date=datetime(2017,1,1) > ) > def make_sub_dag(parent_dag, N): > dag = DAG( > '%s.task_%d' % (parent_dag.dag_id, N), > schedule_interval=parent_dag.schedule_interval, > start_date=parent_dag.start_date > ) > DummyOperator(task_id='task1', dag=dag) >> DummyOperator(task_id='task2', > dag=dag) > return dag > downstream_task = DummyOperator(task_id='downstream', dag=dag) > for N in range(20): > SubDagOperator( > dag=dag, > task_id='task_%d' % N, > subdag=make_sub_dag(dag, N) > ) >> downstream_task > {code} > When there are more than 50 or so subdags this file becomes slow enough to > parse that it fails to load in the web UI on a modest platform such as a > laptop. > It would be nice to support such DAGs, since there are useful workflows > involving 100s or 1000s of subdags. -- This message was sent by Atlassian JIRA (v6.4.14#64029)