Zhou Fang created AIRFLOW-4924:
----------------------------------

             Summary: Loading DAGs asynchronously in Airflow webserver
                 Key: AIRFLOW-4924
                 URL: https://issues.apache.org/jira/browse/AIRFLOW-4924
             Project: Apache Airflow
          Issue Type: New Feature
          Components: webserver
    Affects Versions: 1.10.3, 1.10.2, 1.10.4
            Reporter: Zhou Fang
            Assignee: Zhou Fang
             Fix For: 1.10.4, 1.10.3, 1.10.2


h2. Scalability Issue in Webserver

Airflow webserver uses gunicorn workers to serve HTTP requests. It loads all 
DAGs from DAG files before serving requests. If there are many DAGs (e.g., > 
1,000), loading all DAGs takes a significant amount of time.

Airflow webserver also relies on restarting gunicorn workers to refresh all 
DAGs. This refreshing interval is set by webserver-worker_refresh_interval, 
default to 30s. As a result, if loading all DAGs takes >30s, the webserver will 
never be ready for HTTP requests.

The current solution is to skip loading DAGs by using env var 
SKIP_DAGS_PARSING. It makes the webserver work, but there is no DAG on the UI.
h2. Asynchronously DAG Loading

The solution here is to load DAGs asynchronously in the background. It creates 
a background process to load DAGs, stringifies DAGs, and sends DAGs to gunicorn 
worker process. The stringifying step is needed because some fields can not be 
pickled, e.g., locally defined functions and user defined modules. It 
aggressively transform all fields of DAG and task to be string-compatible.

This feature is enabled by webserver-async_dagbag_loader=True. The background 
process sends DAGs to gunicorn worker gradually (every 
webserver-dagbag_sync_interval). DAG refreshing interval is controlled by 
webserver-collect_dags_interval.

 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to