BasPH commented on a change in pull request #17963:
URL: https://github.com/apache/airflow/pull/17963#discussion_r700266850
##########
File path: docs/apache-airflow/best-practices.rst
##########
@@ -124,10 +124,47 @@ Airflow parses the Python file. For more information,
see: :ref:`managing_variab
Top level Python Code
---------------------
-In general, you should not write any code outside of defining Airflow
constructs like Operators. The code outside the
-tasks runs every time Airflow parses an eligible python file, which happens at
the minimum frequency of
+You should avoid writing the top level code which is not necessary to create
Operators
+and build DAG relations between them. Specifically you should not run any
database access,
+heavy computations and networking operations. The code outside the Operator's
``execute`` methods
+runs every time Airflow parses an eligible python file, which happens at the
minimum frequency of
Review comment:
```suggestion
runs every time Airflow parses an eligible Python file, which happens at the
minimum frequency of
```
##########
File path: docs/apache-airflow/best-practices.rst
##########
@@ -124,10 +124,63 @@ Airflow parses the Python file. For more information,
see: :ref:`managing_variab
Top level Python Code
---------------------
-In general, you should not write any code outside of defining Airflow
constructs like Operators. The code outside the
-tasks runs every time Airflow parses an eligible python file, which happens at
the minimum frequency of
+You should avoid writing the top level code which is not necessary to create
Operators
+and build DAG relations between them. Specifically you should not run any
database access,
+heavy computations and networking operations. The code outside the Operator's
``execute`` methods
+runs every time Airflow parses an eligible python file, which happens at the
minimum frequency of
:ref:`min_file_process_interval<config:scheduler__min_file_process_interval>`
seconds.
+If you need to use some meta-data to prepare your DAG structure, you should
prepare the meta-data externally.
+For example you can export from your database and publish together with the
DAGs in a convenient file format
+(JSON, YAML formats are good candidates). Ideally it should be published in
the same folder as DAG.
+If you do it, then you can read it from within the DAG by using constructs
similar to
+``os.path.dirname(os.path.abspath(__file__))`` to get the directory where you
can load your files
+from in such case.
+
+You can also generate directly Python code containing the meta-data. Then
instead of reading content
+of such generated file from JSON or YAML, you can simply import objects
generated in such Python files. That
+makes it easier to use such code from multiple DAGs without adding the extra
code to find, load and parse
Review comment:
Would mention this approach first if it's the easier/preferred one.
##########
File path: docs/apache-airflow/best-practices.rst
##########
@@ -124,10 +124,47 @@ Airflow parses the Python file. For more information,
see: :ref:`managing_variab
Top level Python Code
---------------------
-In general, you should not write any code outside of defining Airflow
constructs like Operators. The code outside the
-tasks runs every time Airflow parses an eligible python file, which happens at
the minimum frequency of
+You should avoid writing the top level code which is not necessary to create
Operators
+and build DAG relations between them. Specifically you should not run any
database access,
+heavy computations and networking operations. The code outside the Operator's
``execute`` methods
+runs every time Airflow parses an eligible python file, which happens at the
minimum frequency of
:ref:`min_file_process_interval<config:scheduler__min_file_process_interval>`
seconds.
+If you need to use some meta-data to prepare your DAG structure, you should
prepare the meta-data externally.
+For example you can export from your database and publish together with the
DAGs in a convenient file format
+(JSON, YAML formats are good candidates). Ideally it should be published in
the same folder as DAG.
Review comment:
Explain why the metadata should be published in the same folder as DAG
##########
File path: docs/apache-airflow/best-practices.rst
##########
@@ -124,10 +124,63 @@ Airflow parses the Python file. For more information,
see: :ref:`managing_variab
Top level Python Code
---------------------
-In general, you should not write any code outside of defining Airflow
constructs like Operators. The code outside the
-tasks runs every time Airflow parses an eligible python file, which happens at
the minimum frequency of
+You should avoid writing the top level code which is not necessary to create
Operators
+and build DAG relations between them. Specifically you should not run any
database access,
+heavy computations and networking operations. The code outside the Operator's
``execute`` methods
+runs every time Airflow parses an eligible python file, which happens at the
minimum frequency of
:ref:`min_file_process_interval<config:scheduler__min_file_process_interval>`
seconds.
+If you need to use some meta-data to prepare your DAG structure, you should
prepare the meta-data externally.
+For example you can export from your database and publish together with the
DAGs in a convenient file format
+(JSON, YAML formats are good candidates). Ideally it should be published in
the same folder as DAG.
+If you do it, then you can read it from within the DAG by using constructs
similar to
+``os.path.dirname(os.path.abspath(__file__))`` to get the directory where you
can load your files
+from in such case.
+
+You can also generate directly Python code containing the meta-data. Then
instead of reading content
Review comment:
Could you add a list at the start to give an overview of the options
before diving in? For example:
... There are generally two approaches for creating dynamic DAGs ....:
1. Generate a meta-file (e.g. YAML/JSON) which is interpreted by a Python
script which generates tasks/DAGs
2. Generate a Python object (e.g. dict) from which tasks & DAGs are generated
##########
File path: docs/apache-airflow/best-practices.rst
##########
@@ -124,10 +124,47 @@ Airflow parses the Python file. For more information,
see: :ref:`managing_variab
Top level Python Code
---------------------
-In general, you should not write any code outside of defining Airflow
constructs like Operators. The code outside the
-tasks runs every time Airflow parses an eligible python file, which happens at
the minimum frequency of
+You should avoid writing the top level code which is not necessary to create
Operators
Review comment:
Explain why you should avoid top level code
##########
File path: docs/apache-airflow/best-practices.rst
##########
@@ -124,10 +124,47 @@ Airflow parses the Python file. For more information,
see: :ref:`managing_variab
Top level Python Code
---------------------
-In general, you should not write any code outside of defining Airflow
constructs like Operators. The code outside the
-tasks runs every time Airflow parses an eligible python file, which happens at
the minimum frequency of
+You should avoid writing the top level code which is not necessary to create
Operators
+and build DAG relations between them. Specifically you should not run any
database access,
+heavy computations and networking operations. The code outside the Operator's
``execute`` methods
+runs every time Airflow parses an eligible python file, which happens at the
minimum frequency of
:ref:`min_file_process_interval<config:scheduler__min_file_process_interval>`
seconds.
+If you need to use some meta-data to prepare your DAG structure, you should
prepare the meta-data externally.
+For example you can export from your database and publish together with the
DAGs in a convenient file format
+(JSON, YAML formats are good candidates). Ideally it should be published in
the same folder as DAG.
+If you do it, then you can read it from within the DAG by using constructs
similar to
Review comment:
Again, clarify "it"
##########
File path: docs/apache-airflow/best-practices.rst
##########
@@ -124,10 +124,63 @@ Airflow parses the Python file. For more information,
see: :ref:`managing_variab
Top level Python Code
---------------------
-In general, you should not write any code outside of defining Airflow
constructs like Operators. The code outside the
-tasks runs every time Airflow parses an eligible python file, which happens at
the minimum frequency of
+You should avoid writing the top level code which is not necessary to create
Operators
+and build DAG relations between them. Specifically you should not run any
database access,
+heavy computations and networking operations. The code outside the Operator's
``execute`` methods
+runs every time Airflow parses an eligible python file, which happens at the
minimum frequency of
:ref:`min_file_process_interval<config:scheduler__min_file_process_interval>`
seconds.
+If you need to use some meta-data to prepare your DAG structure, you should
prepare the meta-data externally.
+For example you can export from your database and publish together with the
DAGs in a convenient file format
+(JSON, YAML formats are good candidates). Ideally it should be published in
the same folder as DAG.
+If you do it, then you can read it from within the DAG by using constructs
similar to
+``os.path.dirname(os.path.abspath(__file__))`` to get the directory where you
can load your files
+from in such case.
+
+You can also generate directly Python code containing the meta-data. Then
instead of reading content
+of such generated file from JSON or YAML, you can simply import objects
generated in such Python files. That
+makes it easier to use such code from multiple DAGs without adding the extra
code to find, load and parse
+the meta-data. This sounds strange, but it is surprisingly easy to generate
such easy-to-parse and
+valid Python code that you can import from your DAGs.
+
+For example assume you dynamically generate (in your DAG folder), the
``my_company_utils/common.py`` file:
+
+.. code-block:: python
+
+ # This file is generated automatically !
+ ALL_TASKS = ["task1", "task2", "task3"]
+
+Then you should be able to import and use the ``ALL_TASK`` constant in all
your DAGs like that:
Review comment:
```suggestion
Then you can import and use the ``ALL_TASKS`` constant in all your DAGs like
that:
```
##########
File path: docs/apache-airflow/best-practices.rst
##########
@@ -124,10 +124,47 @@ Airflow parses the Python file. For more information,
see: :ref:`managing_variab
Top level Python Code
---------------------
-In general, you should not write any code outside of defining Airflow
constructs like Operators. The code outside the
-tasks runs every time Airflow parses an eligible python file, which happens at
the minimum frequency of
+You should avoid writing the top level code which is not necessary to create
Operators
+and build DAG relations between them. Specifically you should not run any
database access,
+heavy computations and networking operations. The code outside the Operator's
``execute`` methods
+runs every time Airflow parses an eligible python file, which happens at the
minimum frequency of
:ref:`min_file_process_interval<config:scheduler__min_file_process_interval>`
seconds.
+If you need to use some meta-data to prepare your DAG structure, you should
prepare the meta-data externally.
Review comment:
Explain why should you prepare the metadata externally?
##########
File path: docs/apache-airflow/best-practices.rst
##########
@@ -124,10 +124,63 @@ Airflow parses the Python file. For more information,
see: :ref:`managing_variab
Top level Python Code
---------------------
-In general, you should not write any code outside of defining Airflow
constructs like Operators. The code outside the
-tasks runs every time Airflow parses an eligible python file, which happens at
the minimum frequency of
+You should avoid writing the top level code which is not necessary to create
Operators
+and build DAG relations between them. Specifically you should not run any
database access,
+heavy computations and networking operations. The code outside the Operator's
``execute`` methods
+runs every time Airflow parses an eligible python file, which happens at the
minimum frequency of
:ref:`min_file_process_interval<config:scheduler__min_file_process_interval>`
seconds.
+If you need to use some meta-data to prepare your DAG structure, you should
prepare the meta-data externally.
+For example you can export from your database and publish together with the
DAGs in a convenient file format
+(JSON, YAML formats are good candidates). Ideally it should be published in
the same folder as DAG.
+If you do it, then you can read it from within the DAG by using constructs
similar to
+``os.path.dirname(os.path.abspath(__file__))`` to get the directory where you
can load your files
+from in such case.
+
+You can also generate directly Python code containing the meta-data. Then
instead of reading content
+of such generated file from JSON or YAML, you can simply import objects
generated in such Python files. That
+makes it easier to use such code from multiple DAGs without adding the extra
code to find, load and parse
+the meta-data. This sounds strange, but it is surprisingly easy to generate
such easy-to-parse and
+valid Python code that you can import from your DAGs.
+
+For example assume you dynamically generate (in your DAG folder), the
``my_company_utils/common.py`` file:
+
+.. code-block:: python
+
+ # This file is generated automatically !
+ ALL_TASKS = ["task1", "task2", "task3"]
+
+Then you should be able to import and use the ``ALL_TASK`` constant in all
your DAGs like that:
+
+
+.. code-block:: python
+
+ from my_company_utils.common import ALL_TASKS
+
+ with DAG(dag_id="my_dag", schedule_interval=None, start_date=days_ago(2))
as dag:
+ for task in ALL_TASKS:
+ # create your operators and relations here
+ pass
+
+Don't forget that in this case you need to add empty ``__init__.py`` file in
the ``my_company_utils`` folder
+and you should add the ``my_company_utils/.*`` line to ``.airflowignore``
file, so that the whole folder is
+ignored by the scheduler when it looks for DAGs.
+
+
+Triggering DAGs after changes
+-----------------------------
+
+Avoid triggering DAGs immediately after changing them or any other
accompanying files that you change in the
+DAG folder.
+
+You should give the system sufficient time to process the changed files. This
takes several steps.
+First the files have to be distributed to scheduler - usually via distributed
filesystem or Git-Sync, then
+scheduler has to parse the python files and store them in the database.
Depending on your configuration,
+speed of your distributed filesystem, number of files, number of DAGs, number
of changes in the files,
+sizes of the files, number of schedulers, speed of CPUS, this can take from
seconds to minutes, in extreme
+cases many minutes. You need to observe yours system to figure out the delays
you can experience, you can
+also fine-tune that by adjusting Airflow configuration or increasing resources
dedicated to Airflow
+components (CPUs, memory, I/O throughput etc.).
Review comment:
This sounds very complex and not very useful to a user IMO. How about
suggesting to wait for a change to appear in the UI? Also, could you mention
which configurations can be tuned in this situation?
##########
File path: docs/apache-airflow/best-practices.rst
##########
@@ -124,10 +124,47 @@ Airflow parses the Python file. For more information,
see: :ref:`managing_variab
Top level Python Code
---------------------
-In general, you should not write any code outside of defining Airflow
constructs like Operators. The code outside the
-tasks runs every time Airflow parses an eligible python file, which happens at
the minimum frequency of
+You should avoid writing the top level code which is not necessary to create
Operators
+and build DAG relations between them. Specifically you should not run any
database access,
+heavy computations and networking operations. The code outside the Operator's
``execute`` methods
+runs every time Airflow parses an eligible python file, which happens at the
minimum frequency of
:ref:`min_file_process_interval<config:scheduler__min_file_process_interval>`
seconds.
+If you need to use some meta-data to prepare your DAG structure, you should
prepare the meta-data externally.
+For example you can export from your database and publish together with the
DAGs in a convenient file format
+(JSON, YAML formats are good candidates). Ideally it should be published in
the same folder as DAG.
Review comment:
Would replace "it" by "the metadata" here since the metadata is
mentioned 2 sentences up, takes some mental thought to understand the sentence.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]