[ 
https://issues.apache.org/jira/browse/AIRFLOW-1139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15981718#comment-15981718
 ] 

Gerard Toonstra commented on AIRFLOW-1139:
------------------------------------------

Hi David,

That's because the reprocessing of a DAG is tied to the scheduler cycle. This 
is because DAGs can be dynamic, so there are cases where you don't know what 
task instances are going to be in there. What basically happens:

- the scheduler starts threads to process DAG files.
- each thread chooses from the available DAGS one dag to process.
- when a DAG instantiates, it will run all code that is at global level (so 
actually creates task instances). 
- if the dag interval passed, it will create a dagrun db object and a task 
instance db object, basically scheduling the dagrun. 
- it is the file processing thread that does this, not the main scheduler cycle 
itself.
- now that the database contains new dagruns and new task instances to 
schedule, when the main scheduler loop checks for new task instances to run, it 
will discover those. 
- They get sent to an executor.

The min_file_process_interval is one way to try to manage this, but increasing 
this will thus generate larger delays between dags that are analyzed for 
scheduling.

In your case it may be better to reduce max_threads, by default set to 2. This 
will influence the number of threads allocated to dag file processors.
It could mean that you have one thread that's continuously analyzing dags to 
process, but you win one thread available for task execution.

I'll raise this on the dev list with a link back here. This way, committers can 
verify my explanation and there may be a smarter way to improve processing 
performance.


> Scheduler runs very slowly when many DAGs in DAG directory
> ----------------------------------------------------------
>
>                 Key: AIRFLOW-1139
>                 URL: https://issues.apache.org/jira/browse/AIRFLOW-1139
>             Project: Apache Airflow
>          Issue Type: Improvement
>    Affects Versions: 1.8.0
>         Environment: macOS Sierra, v10.12.2, MacBook Pro, 2.5 GHz Intel Core 
> i7, 16 GB RAM
>            Reporter: David Vaughan
>            Priority: Minor
>              Labels: performance
>
> When we have several (10-15) DAGs in our DAG directory, and each of them is 
> pretty large (~900 tasks on average), Airflow's periodic re-processing of the 
> DAGs in our DAG directory takes a long time and takes resources away from 
> running DAGs.
> Almost always we only have one DAG actually running at any given time, and 
> the rest are paused. The one running DAG, however, crawls along noticeably 
> slower than if we only have one or two DAGs total in the DAG directory.
> I think it would be nice to have an option to turn off re-processing of DAGs 
> completely, after the initial processing.
> The way we use Airflow right now, we don't edit our existing DAGs frequently, 
> so we have no need for periodic refresh. We have experimented with the 
> min_file_process_interval option in airflow.cfg, but setting it to small 
> numbers causes no noticeable change, and setting it to very large numbers (to 
> emulate not refreshing at all) actually causes the DAG to run much slower 
> than it already was.
> Is anybody else still experiencing this? Are there existing ways to avoid 
> this problem? Here are some links to people referencing, I believe, this same 
> issue, but they're all from last year:
> https://issues.apache.org/jira/browse/AIRFLOW-160
> https://github.com/apache/incubator-airflow/pull/1636
> https://issues.apache.org/jira/browse/AIRFLOW-435
> http://stackoverflow.com/questions/40466732/apache-airflow-scheduler-slowness
> Thanks in advance for any discussion or help.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to