[GitHub] [airflow] BasPH commented on a change in pull request #17963: Improve the description of how to handle dynamic task generation

GitBox Wed, 01 Sep 2021 21:10:07 -0700


BasPH commented on a change in pull request #17963:
URL: https://github.com/apache/airflow/pull/17963#discussion_r700266850




##########
File path: docs/apache-airflow/best-practices.rst
##########
@@ -124,10 +124,47 @@ Airflow parses the Python file. For more information, 
see: :ref:`managing_variab
 Top level Python Code
 ---------------------
 
-In general, you should not write any code outside of defining Airflow 
constructs like Operators. The code outside the
-tasks runs every time Airflow parses an eligible python file, which happens at 
the minimum frequency of
+You should avoid writing the top level code which is not necessary to create 
Operators
+and build DAG relations between them. Specifically you should not run any 
database access,
+heavy computations and networking operations. The code outside the Operator's 
``execute`` methods
+runs every time Airflow parses an eligible python file, which happens at the 
minimum frequency of

Review comment:
       ```suggestion
   runs every time Airflow parses an eligible Python file, which happens at the 
minimum frequency of
   ```

##########
File path: docs/apache-airflow/best-practices.rst
##########
@@ -124,10 +124,63 @@ Airflow parses the Python file. For more information, 
see: :ref:`managing_variab
 Top level Python Code
 ---------------------
 
-In general, you should not write any code outside of defining Airflow 
constructs like Operators. The code outside the
-tasks runs every time Airflow parses an eligible python file, which happens at 
the minimum frequency of
+You should avoid writing the top level code which is not necessary to create 
Operators
+and build DAG relations between them. Specifically you should not run any 
database access,
+heavy computations and networking operations. The code outside the Operator's 
``execute`` methods
+runs every time Airflow parses an eligible python file, which happens at the 
minimum frequency of
 :ref:`min_file_process_interval<config:scheduler__min_file_process_interval>` 
seconds.
 
+If you need to use some meta-data to prepare your DAG structure, you should 
prepare the meta-data externally.
+For example you can export from your database and publish together with the 
DAGs in a convenient file format
+(JSON, YAML formats are good candidates). Ideally it should be published in 
the same folder as DAG.
+If you do it, then you can read it from within the DAG by using constructs 
similar to
+``os.path.dirname(os.path.abspath(__file__))`` to get the directory where you 
can load your files
+from in such case.
+
+You can also generate directly Python code containing the meta-data. Then 
instead of reading content
+of such generated file from JSON or YAML, you can simply import objects 
generated in such Python files. That
+makes it easier to use such code from multiple DAGs without adding the extra 
code to find, load and parse

Review comment:
       Would mention this approach first if it's the easier/preferred one.

##########
File path: docs/apache-airflow/best-practices.rst
##########
@@ -124,10 +124,47 @@ Airflow parses the Python file. For more information, 
see: :ref:`managing_variab
 Top level Python Code
 ---------------------
 
-In general, you should not write any code outside of defining Airflow 
constructs like Operators. The code outside the
-tasks runs every time Airflow parses an eligible python file, which happens at 
the minimum frequency of
+You should avoid writing the top level code which is not necessary to create 
Operators
+and build DAG relations between them. Specifically you should not run any 
database access,
+heavy computations and networking operations. The code outside the Operator's 
``execute`` methods
+runs every time Airflow parses an eligible python file, which happens at the 
minimum frequency of
 :ref:`min_file_process_interval<config:scheduler__min_file_process_interval>` 
seconds.
 
+If you need to use some meta-data to prepare your DAG structure, you should 
prepare the meta-data externally.
+For example you can export from your database and publish together with the 
DAGs in a convenient file format
+(JSON, YAML formats are good candidates). Ideally it should be published in 
the same folder as DAG.

Review comment:
       Explain why the metadata should be published in the same folder as DAG

##########
File path: docs/apache-airflow/best-practices.rst
##########
@@ -124,10 +124,63 @@ Airflow parses the Python file. For more information, 
see: :ref:`managing_variab
 Top level Python Code
 ---------------------
 
-In general, you should not write any code outside of defining Airflow 
constructs like Operators. The code outside the
-tasks runs every time Airflow parses an eligible python file, which happens at 
the minimum frequency of
+You should avoid writing the top level code which is not necessary to create 
Operators
+and build DAG relations between them. Specifically you should not run any 
database access,
+heavy computations and networking operations. The code outside the Operator's 
``execute`` methods
+runs every time Airflow parses an eligible python file, which happens at the 
minimum frequency of
 :ref:`min_file_process_interval<config:scheduler__min_file_process_interval>` 
seconds.
 
+If you need to use some meta-data to prepare your DAG structure, you should 
prepare the meta-data externally.
+For example you can export from your database and publish together with the 
DAGs in a convenient file format
+(JSON, YAML formats are good candidates). Ideally it should be published in 
the same folder as DAG.
+If you do it, then you can read it from within the DAG by using constructs 
similar to
+``os.path.dirname(os.path.abspath(__file__))`` to get the directory where you 
can load your files
+from in such case.
+
+You can also generate directly Python code containing the meta-data. Then 
instead of reading content

Review comment:
       Could you add a list at the start to give an overview of the options 
before diving in? For example:
   
   ... There are generally two approaches for creating dynamic DAGs ....:
   1. Generate a meta-file (e.g. YAML/JSON) which is interpreted by a Python 
script which generates tasks/DAGs
   2. Generate a Python object (e.g. dict) from which tasks & DAGs are generated

##########
File path: docs/apache-airflow/best-practices.rst
##########
@@ -124,10 +124,47 @@ Airflow parses the Python file. For more information, 
see: :ref:`managing_variab
 Top level Python Code
 ---------------------
 
-In general, you should not write any code outside of defining Airflow 
constructs like Operators. The code outside the
-tasks runs every time Airflow parses an eligible python file, which happens at 
the minimum frequency of
+You should avoid writing the top level code which is not necessary to create 
Operators

Review comment:
       Explain why you should avoid top level code

##########
File path: docs/apache-airflow/best-practices.rst
##########
@@ -124,10 +124,47 @@ Airflow parses the Python file. For more information, 
see: :ref:`managing_variab
 Top level Python Code
 ---------------------
 
-In general, you should not write any code outside of defining Airflow 
constructs like Operators. The code outside the
-tasks runs every time Airflow parses an eligible python file, which happens at 
the minimum frequency of
+You should avoid writing the top level code which is not necessary to create 
Operators
+and build DAG relations between them. Specifically you should not run any 
database access,
+heavy computations and networking operations. The code outside the Operator's 
``execute`` methods
+runs every time Airflow parses an eligible python file, which happens at the 
minimum frequency of
 :ref:`min_file_process_interval<config:scheduler__min_file_process_interval>` 
seconds.
 
+If you need to use some meta-data to prepare your DAG structure, you should 
prepare the meta-data externally.
+For example you can export from your database and publish together with the 
DAGs in a convenient file format
+(JSON, YAML formats are good candidates). Ideally it should be published in 
the same folder as DAG.
+If you do it, then you can read it from within the DAG by using constructs 
similar to

Review comment:
       Again, clarify "it"

##########
File path: docs/apache-airflow/best-practices.rst
##########
@@ -124,10 +124,63 @@ Airflow parses the Python file. For more information, 
see: :ref:`managing_variab
 Top level Python Code
 ---------------------
 
-In general, you should not write any code outside of defining Airflow 
constructs like Operators. The code outside the
-tasks runs every time Airflow parses an eligible python file, which happens at 
the minimum frequency of
+You should avoid writing the top level code which is not necessary to create 
Operators
+and build DAG relations between them. Specifically you should not run any 
database access,
+heavy computations and networking operations. The code outside the Operator's 
``execute`` methods
+runs every time Airflow parses an eligible python file, which happens at the 
minimum frequency of
 :ref:`min_file_process_interval<config:scheduler__min_file_process_interval>` 
seconds.
 
+If you need to use some meta-data to prepare your DAG structure, you should 
prepare the meta-data externally.
+For example you can export from your database and publish together with the 
DAGs in a convenient file format
+(JSON, YAML formats are good candidates). Ideally it should be published in 
the same folder as DAG.
+If you do it, then you can read it from within the DAG by using constructs 
similar to
+``os.path.dirname(os.path.abspath(__file__))`` to get the directory where you 
can load your files
+from in such case.
+
+You can also generate directly Python code containing the meta-data. Then 
instead of reading content
+of such generated file from JSON or YAML, you can simply import objects 
generated in such Python files. That
+makes it easier to use such code from multiple DAGs without adding the extra 
code to find, load and parse
+the meta-data. This sounds strange, but it is surprisingly easy to generate 
such easy-to-parse and
+valid Python code that you can import from your DAGs.
+
+For example assume you dynamically generate (in your DAG folder), the 
``my_company_utils/common.py`` file:
+
+.. code-block:: python
+
+    # This file is generated automatically !
+    ALL_TASKS = ["task1", "task2", "task3"]
+
+Then you should be able to import and use the ``ALL_TASK`` constant in all 
your DAGs like that:

Review comment:
       ```suggestion
   Then you can import and use the ``ALL_TASKS`` constant in all your DAGs like 
that:
   ```

##########
File path: docs/apache-airflow/best-practices.rst
##########
@@ -124,10 +124,47 @@ Airflow parses the Python file. For more information, 
see: :ref:`managing_variab
 Top level Python Code
 ---------------------
 
-In general, you should not write any code outside of defining Airflow 
constructs like Operators. The code outside the
-tasks runs every time Airflow parses an eligible python file, which happens at 
the minimum frequency of
+You should avoid writing the top level code which is not necessary to create 
Operators
+and build DAG relations between them. Specifically you should not run any 
database access,
+heavy computations and networking operations. The code outside the Operator's 
``execute`` methods
+runs every time Airflow parses an eligible python file, which happens at the 
minimum frequency of
 :ref:`min_file_process_interval<config:scheduler__min_file_process_interval>` 
seconds.
 
+If you need to use some meta-data to prepare your DAG structure, you should 
prepare the meta-data externally.

Review comment:
       Explain why should you prepare the metadata externally?

##########
File path: docs/apache-airflow/best-practices.rst
##########
@@ -124,10 +124,63 @@ Airflow parses the Python file. For more information, 
see: :ref:`managing_variab
 Top level Python Code
 ---------------------
 
-In general, you should not write any code outside of defining Airflow 
constructs like Operators. The code outside the
-tasks runs every time Airflow parses an eligible python file, which happens at 
the minimum frequency of
+You should avoid writing the top level code which is not necessary to create 
Operators
+and build DAG relations between them. Specifically you should not run any 
database access,
+heavy computations and networking operations. The code outside the Operator's 
``execute`` methods
+runs every time Airflow parses an eligible python file, which happens at the 
minimum frequency of
 :ref:`min_file_process_interval<config:scheduler__min_file_process_interval>` 
seconds.
 
+If you need to use some meta-data to prepare your DAG structure, you should 
prepare the meta-data externally.
+For example you can export from your database and publish together with the 
DAGs in a convenient file format
+(JSON, YAML formats are good candidates). Ideally it should be published in 
the same folder as DAG.
+If you do it, then you can read it from within the DAG by using constructs 
similar to
+``os.path.dirname(os.path.abspath(__file__))`` to get the directory where you 
can load your files
+from in such case.
+
+You can also generate directly Python code containing the meta-data. Then 
instead of reading content
+of such generated file from JSON or YAML, you can simply import objects 
generated in such Python files. That
+makes it easier to use such code from multiple DAGs without adding the extra 
code to find, load and parse
+the meta-data. This sounds strange, but it is surprisingly easy to generate 
such easy-to-parse and
+valid Python code that you can import from your DAGs.
+
+For example assume you dynamically generate (in your DAG folder), the 
``my_company_utils/common.py`` file:
+
+.. code-block:: python
+
+    # This file is generated automatically !
+    ALL_TASKS = ["task1", "task2", "task3"]
+
+Then you should be able to import and use the ``ALL_TASK`` constant in all 
your DAGs like that:
+
+
+.. code-block:: python
+
+    from my_company_utils.common import ALL_TASKS
+
+    with DAG(dag_id="my_dag", schedule_interval=None, start_date=days_ago(2)) 
as dag:
+        for task in ALL_TASKS:
+            # create your operators and relations here
+            pass
+
+Don't forget that in this case you need to add empty ``__init__.py`` file in 
the ``my_company_utils`` folder
+and you should add the ``my_company_utils/.*`` line to ``.airflowignore`` 
file, so that the whole folder is
+ignored by the scheduler when it looks for DAGs.
+
+
+Triggering DAGs after changes
+-----------------------------
+
+Avoid triggering DAGs immediately after changing them or any other 
accompanying files that you change in the
+DAG folder.
+
+You should give the system sufficient time to process the changed files. This 
takes several steps.
+First the files have to be distributed to scheduler - usually via distributed 
filesystem or Git-Sync, then
+scheduler has to parse the python files and store them in the database. 
Depending on your configuration,
+speed of your distributed filesystem, number of files, number of DAGs, number 
of changes in the files,
+sizes of the files, number of schedulers, speed of CPUS, this can take from 
seconds to minutes, in extreme
+cases many minutes. You need to observe yours system to figure out the delays 
you can experience, you can
+also fine-tune that by adjusting Airflow configuration or increasing resources 
dedicated to Airflow
+components (CPUs, memory, I/O throughput etc.).

Review comment:
       This sounds very complex and not very useful to a user IMO. How about 
suggesting to wait for a change to appear in the UI? Also, could you mention 
which configurations can be tuned in this situation?

##########
File path: docs/apache-airflow/best-practices.rst
##########
@@ -124,10 +124,47 @@ Airflow parses the Python file. For more information, 
see: :ref:`managing_variab
 Top level Python Code
 ---------------------
 
-In general, you should not write any code outside of defining Airflow 
constructs like Operators. The code outside the
-tasks runs every time Airflow parses an eligible python file, which happens at 
the minimum frequency of
+You should avoid writing the top level code which is not necessary to create 
Operators
+and build DAG relations between them. Specifically you should not run any 
database access,
+heavy computations and networking operations. The code outside the Operator's 
``execute`` methods
+runs every time Airflow parses an eligible python file, which happens at the 
minimum frequency of
 :ref:`min_file_process_interval<config:scheduler__min_file_process_interval>` 
seconds.
 
+If you need to use some meta-data to prepare your DAG structure, you should 
prepare the meta-data externally.
+For example you can export from your database and publish together with the 
DAGs in a convenient file format
+(JSON, YAML formats are good candidates). Ideally it should be published in 
the same folder as DAG.

Review comment:
       Would replace "it" by "the metadata" here since the metadata is 
mentioned 2 sentences up, takes some mental thought to understand the sentence.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [airflow] BasPH commented on a change in pull request #17963: Improve the description of how to handle dynamic task generation

Reply via email to