When we updated to Airflow 1.9, we noticed that there was a pretty big delay between tasks (somewhere between 2-4 minutes, even after playing around with the min_file_process_interval and max_threads configs). Our thought was that if we reduce the number of files that the scheduler has to process, then the scheduler would process files for unpaused DAGs more frequently, reducing the delay between tasks.
On 2017-11-27 11:23, Alek Storm <[email protected]> wrote: > What's the advantage of this change? Performance? > > Alek > > On Mon, Nov 27, 2017 at 1:11 PM, [email protected] < > [email protected]> wrote: > > > Hi all, > > > > I wanted to gauge community interest in this idea we have. We are > > currently running a modified version of Airflow 1.9 RC3 where we ignore > > processing DAG definition Python files for paused DAGs. By default, > > list_py_file_paths traverses the dags subdirectory to look for Python > > files, and the scheduler processes all these files, regardless of whether > > the DAGs defined in these files are paused or not. Our proposed > > modification was to query the fileloc column in the dag table, filtering > > on is_paused=1 and is_active=1 to get a list of file paths for paused DAGs. > > Then, we can exclude these files from the known_file_paths, so that the > > scheduler does not process these files. This feature can be set on and off > > via a scheduler config variable. > > > > If anyone is interested, we already have the code written, so we'd be > > happy to package up our changes and create a PR. > > > > Thanks! > > -Andy > > >
