[ 
https://issues.apache.org/jira/browse/AIRFLOW-4173?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16803769#comment-16803769
 ] 

ASF GitHub Bot commented on AIRFLOW-4173:
-----------------------------------------

XD-DENG commented on pull request #4993: [AIRFLOW-4173] Improve scheduler 
performance by avoid unnecessary actions in SchedulerJob.process_file()
URL: https://github.com/apache/airflow/pull/4993
 
 
   ### Jira
   
     - https://issues.apache.org/jira/browse/AIRFLOW-4173
   
   ### Description
   
   In current implementation of `SchedulerJob.process_file()` 
https://github.com/apache/airflow/blob/068ded96cd279dcd51f5b6d1e96f09205ecf40c8/airflow/jobs.py#L1722-L1734,
 action `dag = dagbag.get_dag(dag_id)` is to be done no matter if dag_id is 
pointing to a paused DAG. However, the result will not be used later if that 
DAG is paused. This is causing inefficiency.
   
   We can do the `if dag_id not in paused_dag_ids:` check first, before we 
invoke `dag = dagbag.get_dag(dag_id)`.
   
   This change may bring considerable improvement (running `dag = 
dagbag.get_dag(dag_id)` for 1000 dag_ids is taking ~8 seconds in my 
environment). 
   
   
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Improve scheduler performance by avoid Unnecessary actions in 
> SchedulerJob.process_file()
> -----------------------------------------------------------------------------------------
>
>                 Key: AIRFLOW-4173
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-4173
>             Project: Apache Airflow
>          Issue Type: Improvement
>          Components: scheduler
>    Affects Versions: 1.10.2
>            Reporter: Xiaodong DENG
>            Assignee: Xiaodong DENG
>            Priority: Critical
>
> In current implementation of *SchedulerJob.process_file()* 
> ([https://github.com/apache/airflow/blob/068ded96cd279dcd51f5b6d1e96f09205ecf40c8/airflow/jobs.py#L1722-L1734),]
>  action '*dag = dagbag.get_dag(dag_id)*' is to be done no matter if dag_id is 
> pointing to a paused DAG. However, the result will not be used later if that 
> DAG is paused.
> This is causing inefficiency.
> We can do the `if DAG is paused` check first, before we invoke '*dag = 
> dagbag.get_dag(dag_id)*'. This may bring considerable improvement.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to