This is an automated email from the ASF dual-hosted git repository.

ephraimanierobi pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/airflow.git


The following commit(s) were added to refs/heads/main by this push:
     new 61bc8ffdff Update doc for DAG file processing (#23209)
61bc8ffdff is described below

commit 61bc8ffdffb18168cd5137fb1b8cf82709c2aa5a
Author: Ephraim Anierobi <[email protected]>
AuthorDate: Mon Apr 25 13:52:56 2022 +0100

    Update doc for DAG file processing (#23209)
    
    We can now run the ``DagFileProcessorProcess`` in a separate process and 
it's not fully documented
    
    Co-authored-by: Tzu-ping Chung <[email protected]>
---
 .../apache-airflow/concepts/dagfile-processing.rst | 46 ++++++++++++++++++++++
 docs/apache-airflow/concepts/index.rst             |  1 +
 docs/apache-airflow/concepts/scheduler.rst         | 25 ++----------
 3 files changed, 50 insertions(+), 22 deletions(-)

diff --git a/docs/apache-airflow/concepts/dagfile-processing.rst 
b/docs/apache-airflow/concepts/dagfile-processing.rst
new file mode 100644
index 0000000000..676fd7af78
--- /dev/null
+++ b/docs/apache-airflow/concepts/dagfile-processing.rst
@@ -0,0 +1,46 @@
+ .. Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+
+ ..   http://www.apache.org/licenses/LICENSE-2.0
+
+ .. Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+DAG File Processing
+-------------------
+
+DAG File Processing refers to the process of turning Python files contained in 
the DAGs folder into DAG objects that contain tasks to be scheduled.
+
+There are two primary components involved in DAG file processing.  The 
``DagFileProcessorManager`` is a process executing an infinite loop that 
determines which files need
+to be processed, and the ``DagFileProcessorProcess`` is a separate process 
that is started to convert an individual file into one or more DAG objects.
+
+The ``DagFileProcessorManager`` runs user codes. As a result, you can decide 
to run it as a standalone process in a different host than the scheduler 
process.
+If you decide to run it as a standalone process, you need to set this 
configuration: ``AIRFLOW__SCHEDULER__STANDALONE_DAG_PROCESSOR=True`` and
+run the ``airflow dag-processor`` CLI command, otherwise, starting the 
scheduler process (``airflow scheduler``) also starts the 
``DagFileProcessorManager``.
+
+.. image:: /img/dag_file_processing_diagram.png
+
+``DagFileProcessorManager`` has the following steps:
+
+1. Check for new files:  If the elapsed time since the DAG was last refreshed 
is > :ref:`config:scheduler__dag_dir_list_interval` then update the file paths 
list
+2. Exclude recently processed files:  Exclude files that have been processed 
more recently than 
:ref:`min_file_process_interval<config:scheduler__min_file_process_interval>` 
and have not been modified
+3. Queue file paths: Add files discovered to the file path queue
+4. Process files:  Start a new ``DagFileProcessorProcess`` for each file, up 
to a maximum of :ref:`config:scheduler__parsing_processes`
+5. Collect results: Collect the result from any finished DAG processors
+6. Log statistics:  Print statistics and emit 
``dag_processing.total_parse_time``
+
+``DagFileProcessorProcess`` has the following steps:
+
+1. Process file: The entire process must complete within 
:ref:`dag_file_processor_timeout<config:core__dag_file_processor_timeout>`
+2. The DAG files are loaded as Python module: Must complete within 
:ref:`dagbag_import_timeout<config:core__dagbag_import_timeout>`
+3. Process modules:  Find DAG objects within Python module
+4. Return DagBag:  Provide the ``DagFileProcessorManager`` a list of the 
discovered DAG objects
diff --git a/docs/apache-airflow/concepts/index.rst 
b/docs/apache-airflow/concepts/index.rst
index f4f0cb3b65..122dc760fe 100644
--- a/docs/apache-airflow/concepts/index.rst
+++ b/docs/apache-airflow/concepts/index.rst
@@ -43,6 +43,7 @@ Here you can find detailed documentation about each one of 
Airflow's core concep
     taskflow
     ../executor/index
     scheduler
+    dagfile-processing
     pools
     timetable
     priority-weight
diff --git a/docs/apache-airflow/concepts/scheduler.rst 
b/docs/apache-airflow/concepts/scheduler.rst
index 420d464f11..0ee6724699 100644
--- a/docs/apache-airflow/concepts/scheduler.rst
+++ b/docs/apache-airflow/concepts/scheduler.rst
@@ -63,29 +63,10 @@ In the UI, it appears as if Airflow is running your tasks a 
day **late**
 DAG File Processing
 -------------------
 
-The Airflow Scheduler is responsible for turning the Python files contained in 
the DAGs folder into DAG objects that contain tasks to be scheduled.
-
-There are two primary components involved in DAG file processing.  The 
``DagFileProcessorManager`` is a process executing an infinite loop that 
determines which files need
-to be processed, and the ``DagFileProcessorProcess`` is a separate process 
that is started to convert an individual file into one or more DAG objects.
-
-.. image:: /img/dag_file_processing_diagram.png
-
-``DagFileProcessorManager`` has the following steps:
-
-1. Check for new files:  If the elapsed time since the DAG was last refreshed 
is > :ref:`config:scheduler__dag_dir_list_interval` then update the file paths 
list
-2. Exclude recently processed files:  Exclude files that have been processed 
more recently than 
:ref:`min_file_process_interval<config:scheduler__min_file_process_interval>` 
and have not been modified
-3. Queue file paths: Add files discovered to the file path queue
-4. Process files:  Start a new ``DagFileProcessorProcess`` for each file, up 
to a maximum of :ref:`config:scheduler__parsing_processes`
-5. Collect results: Collect the result from any finished DAG processors
-6. Log statistics:  Print statistics and emit 
``dag_processing.total_parse_time``
-
-``DagFileProcessorProcess`` has the following steps:
-
-1. Process file: The entire process must complete within 
:ref:`dag_file_processor_timeout<config:core__dag_file_processor_timeout>`
-2. Load modules from file: Uses Python imp command, must complete within 
:ref:`dagbag_import_timeout<config:core__dagbag_import_timeout>`
-3. Process modules:  Find DAG objects within Python module
-4. Return DagBag:  Provide the ``DagFileProcessorManager`` a list of the 
discovered DAG objects
+You can have the Airflow Scheduler be responsible for starting the process 
that turns the Python files contained in the DAGs folder into DAG objects
+that contain tasks to be scheduled.
 
+Refer to :doc:`dagfile-processing` for details on how this can be achieved
 
 
 Triggering DAG with Future Date

Reply via email to