This is an automated email from the ASF dual-hosted git repository.
ephraimanierobi pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/airflow.git
The following commit(s) were added to refs/heads/main by this push:
new 61bc8ffdff Update doc for DAG file processing (#23209)
61bc8ffdff is described below
commit 61bc8ffdffb18168cd5137fb1b8cf82709c2aa5a
Author: Ephraim Anierobi <[email protected]>
AuthorDate: Mon Apr 25 13:52:56 2022 +0100
Update doc for DAG file processing (#23209)
We can now run the ``DagFileProcessorProcess`` in a separate process and
it's not fully documented
Co-authored-by: Tzu-ping Chung <[email protected]>
---
.../apache-airflow/concepts/dagfile-processing.rst | 46 ++++++++++++++++++++++
docs/apache-airflow/concepts/index.rst | 1 +
docs/apache-airflow/concepts/scheduler.rst | 25 ++----------
3 files changed, 50 insertions(+), 22 deletions(-)
diff --git a/docs/apache-airflow/concepts/dagfile-processing.rst
b/docs/apache-airflow/concepts/dagfile-processing.rst
new file mode 100644
index 0000000000..676fd7af78
--- /dev/null
+++ b/docs/apache-airflow/concepts/dagfile-processing.rst
@@ -0,0 +1,46 @@
+ .. Licensed to the Apache Software Foundation (ASF) under one
+ or more contributor license agreements. See the NOTICE file
+ distributed with this work for additional information
+ regarding copyright ownership. The ASF licenses this file
+ to you under the Apache License, Version 2.0 (the
+ "License"); you may not use this file except in compliance
+ with the License. You may obtain a copy of the License at
+
+ .. http://www.apache.org/licenses/LICENSE-2.0
+
+ .. Unless required by applicable law or agreed to in writing,
+ software distributed under the License is distributed on an
+ "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ KIND, either express or implied. See the License for the
+ specific language governing permissions and limitations
+ under the License.
+
+DAG File Processing
+-------------------
+
+DAG File Processing refers to the process of turning Python files contained in
the DAGs folder into DAG objects that contain tasks to be scheduled.
+
+There are two primary components involved in DAG file processing. The
``DagFileProcessorManager`` is a process executing an infinite loop that
determines which files need
+to be processed, and the ``DagFileProcessorProcess`` is a separate process
that is started to convert an individual file into one or more DAG objects.
+
+The ``DagFileProcessorManager`` runs user codes. As a result, you can decide
to run it as a standalone process in a different host than the scheduler
process.
+If you decide to run it as a standalone process, you need to set this
configuration: ``AIRFLOW__SCHEDULER__STANDALONE_DAG_PROCESSOR=True`` and
+run the ``airflow dag-processor`` CLI command, otherwise, starting the
scheduler process (``airflow scheduler``) also starts the
``DagFileProcessorManager``.
+
+.. image:: /img/dag_file_processing_diagram.png
+
+``DagFileProcessorManager`` has the following steps:
+
+1. Check for new files: If the elapsed time since the DAG was last refreshed
is > :ref:`config:scheduler__dag_dir_list_interval` then update the file paths
list
+2. Exclude recently processed files: Exclude files that have been processed
more recently than
:ref:`min_file_process_interval<config:scheduler__min_file_process_interval>`
and have not been modified
+3. Queue file paths: Add files discovered to the file path queue
+4. Process files: Start a new ``DagFileProcessorProcess`` for each file, up
to a maximum of :ref:`config:scheduler__parsing_processes`
+5. Collect results: Collect the result from any finished DAG processors
+6. Log statistics: Print statistics and emit
``dag_processing.total_parse_time``
+
+``DagFileProcessorProcess`` has the following steps:
+
+1. Process file: The entire process must complete within
:ref:`dag_file_processor_timeout<config:core__dag_file_processor_timeout>`
+2. The DAG files are loaded as Python module: Must complete within
:ref:`dagbag_import_timeout<config:core__dagbag_import_timeout>`
+3. Process modules: Find DAG objects within Python module
+4. Return DagBag: Provide the ``DagFileProcessorManager`` a list of the
discovered DAG objects
diff --git a/docs/apache-airflow/concepts/index.rst
b/docs/apache-airflow/concepts/index.rst
index f4f0cb3b65..122dc760fe 100644
--- a/docs/apache-airflow/concepts/index.rst
+++ b/docs/apache-airflow/concepts/index.rst
@@ -43,6 +43,7 @@ Here you can find detailed documentation about each one of
Airflow's core concep
taskflow
../executor/index
scheduler
+ dagfile-processing
pools
timetable
priority-weight
diff --git a/docs/apache-airflow/concepts/scheduler.rst
b/docs/apache-airflow/concepts/scheduler.rst
index 420d464f11..0ee6724699 100644
--- a/docs/apache-airflow/concepts/scheduler.rst
+++ b/docs/apache-airflow/concepts/scheduler.rst
@@ -63,29 +63,10 @@ In the UI, it appears as if Airflow is running your tasks a
day **late**
DAG File Processing
-------------------
-The Airflow Scheduler is responsible for turning the Python files contained in
the DAGs folder into DAG objects that contain tasks to be scheduled.
-
-There are two primary components involved in DAG file processing. The
``DagFileProcessorManager`` is a process executing an infinite loop that
determines which files need
-to be processed, and the ``DagFileProcessorProcess`` is a separate process
that is started to convert an individual file into one or more DAG objects.
-
-.. image:: /img/dag_file_processing_diagram.png
-
-``DagFileProcessorManager`` has the following steps:
-
-1. Check for new files: If the elapsed time since the DAG was last refreshed
is > :ref:`config:scheduler__dag_dir_list_interval` then update the file paths
list
-2. Exclude recently processed files: Exclude files that have been processed
more recently than
:ref:`min_file_process_interval<config:scheduler__min_file_process_interval>`
and have not been modified
-3. Queue file paths: Add files discovered to the file path queue
-4. Process files: Start a new ``DagFileProcessorProcess`` for each file, up
to a maximum of :ref:`config:scheduler__parsing_processes`
-5. Collect results: Collect the result from any finished DAG processors
-6. Log statistics: Print statistics and emit
``dag_processing.total_parse_time``
-
-``DagFileProcessorProcess`` has the following steps:
-
-1. Process file: The entire process must complete within
:ref:`dag_file_processor_timeout<config:core__dag_file_processor_timeout>`
-2. Load modules from file: Uses Python imp command, must complete within
:ref:`dagbag_import_timeout<config:core__dagbag_import_timeout>`
-3. Process modules: Find DAG objects within Python module
-4. Return DagBag: Provide the ``DagFileProcessorManager`` a list of the
discovered DAG objects
+You can have the Airflow Scheduler be responsible for starting the process
that turns the Python files contained in the DAGs folder into DAG objects
+that contain tasks to be scheduled.
+Refer to :doc:`dagfile-processing` for details on how this can be achieved
Triggering DAG with Future Date