[GitHub] [airflow] BasPH commented on a diff in pull request #25121: Add "Optimizing" chapter to dynamic-dags section

GitBox Mon, 18 Jul 2022 04:24:28 -0700


BasPH commented on code in PR #25121:
URL: https://github.com/apache/airflow/pull/25121#discussion_r923249662



##########
docs/apache-airflow/howto/dynamic-dag-generation.rst:
##########
@@ -140,3 +140,20 @@ Each of them can run separately with related configuration
 
 .. warning::
   Using this practice, pay attention to "late binding" behaviour in Python 
loops. See `that GitHub discussion 
<https://github.com/apache/airflow/discussions/21278#discussioncomment-2103559>`_
 for more details
+
+
+Optimizing DAG parsing in workers/Kubernetes Pods
+-------------------------------------------------
+
+Sometimes when you generate a lot of Dynamic DAGs in single DAG file, it might 
cause unnecessary delays
+when the DAG file is parsed in worker or in Kubernetes POD. In Workers or 
Kubernetes PODs, you actually
+need only the single DAG (and even single Task of the DAG) to be instantiated 
in order to execute the task.
+If creating your DAG objects takes a lot of time, and each generated DAG is 
created independently from each
+other, this might be optimized away by simply skipping the generation of DAGs 
in worker.

Review Comment:
   Couple of suggestions for clarity/conciseness. Would also add a 
self-contained example, so that the reader can gather all information from just 
the docs.
   
   ```suggestion
   Sometimes when you generate a lot of Dynamic DAGs in single DAG file, it 
might cause unnecessary delays
   when the DAG file is parsed in worker or in Kubernetes POD. In Workers or 
Kubernetes PODs, you actually
   need only the single DAG (and even single Task of the DAG) to be 
instantiated in order to execute the task.
   If creating your DAG objects takes a lot of time, and each generated DAG is 
created independently from each
   other, this might be optimized away by simply skipping the generation of 
DAGs in worker.
   ```
   
   The parsing time of dynamically generated DAGs in Airflow workers can be 
optimized. This optimization is most effective when the number of generated 
DAGs is high. The Airflow scheduler requires loading of a complete DAG file to 
process all metadata. However, an Airflow worker requires only a single DAG 
object to execute a task. This allows us to skip the generation of unnecessary 
DAG objects in the worker, shortening the parsing time. Upon evaluation of a 
DAG file, command line arguments are supplied which we can use to determine 
whether the scheduler or worker evaluates the file:
   
   - Scheduler args: ``["scheduler"]``
   - Worker args: ``["airflow", "tasks", "run", "dag_id", "task_id", ...]``
   
   Upon iterating over the collection of things to generate DAGs for, use these 
arguments to determine whether you need to generate all DAG objects (when 
running in the scheduler), or to generate only a single DAG object (when 
running in a worker):
   
   .. code-block:: python
       :emphasize-lines: 1,2,3,7,8
   
       current_dag = None
       if len(sys.argv) > 3:
           current_dag = sys.argv[3]
   
       for thing in list_of_things:
           dag_id = f"generated_dag_{thing}"
           if current_dag is not None and current_dag != dag_id:
               continue  # skip generation of non-selected DAG
           
           dag = DAG(dag_id=dag_id, ...)
           globals()[dag_id] = dag
   
   A nice example is shown in the
   [Airflow's Magic 
Loop](https://medium.com/apache-airflow/airflows-magic-loop-ec424b05b629) blog 
post that describes how parsing in workers was reduced from 120 seconds to 200 
ms.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [airflow] BasPH commented on a diff in pull request #25121: Add "Optimizing" chapter to dynamic-dags section

Reply via email to