potiuk commented on a change in pull request #18356:
URL: https://github.com/apache/airflow/pull/18356#discussion_r712937017
##########
File path: docs/apache-airflow/best-practices.rst
##########
@@ -241,11 +243,51 @@ each parameter by following the links):
* :ref:`config:scheduler__parsing_processes`
* :ref:`config:scheduler__file_parsing_sort_mode`
+.. _best_practices/reducing_dag_complexity:
+
+Reducing DAG complexity
+^^^^^^^^^^^^^^^^^^^^^^^
+
+While Airflow is good in handling a lot of DAGs with a lot of task and
dependencies between them, when you
+have many complex DAGs, their complexity might impact performance of
scheduling. One of the ways to keep
+your Airflow instance performant and well utilized, you should strive to
simplify and optimize your DAGs
+whenever possible - you have to remember that DAG parsing process and creation
is just executing
+Python code and it's up to you to make it as performant as possible. There are
no magic recipes for making
+your DAG "less complex" - since this is a Python code, it's the DAG writer who
controls the complexity of
+their code.
+
+There are no "metrics" for DAG complexity, especially, there are no metrics
that can tell you
+whether your DAG is "simple enough". However - as with any Python code you can
definitely tell that
+your code is "simpler" or "faster" when you optimize it, the same can be said
about DAG code. If you
+want to optimize your DAGs there are the following actions you can take:
+
+* Make your DAG load faster. This is a single improvement advice that might be
implemented in various ways
+ but this is the one that has biggest impact on scheduler's performance.
Whenever you have a chance to make
+ your DAG load faster - go for it, if your goal is to improve performance.
See below
+ :ref:`best_practices/dag_loader_test` on how to asses your DAG loading time.
+
+* Make your DAG generate fewer tasks. Every task adds additional processing
overhead for scheduling and
+ execution. If you can decrease the number of tasks that your DAG use, this
will likely improve overall
+ scheduling and performance (however be aware that Airflow's flexibility
comes from splitting the
+ work between multiple independent and sometimes parallel tasks and it makes
it easier to reason
+ about the logic of your DAG when it is split to a number independent,
standalone tasks. Also Airflow allows
+ to re-run only specific tasks when needed which might improve
maintainability of the DAG - so you have to
+ strike the right balance between optimization, readability and
maintainability which is best for your team.
+
+* Make smaller number of DAGs per file. While Airflow 2 is optimized for the
case of having multiple DAGs
+ in one file, there are some parts of the system that make it sometimes less
performant, or introduce more
+ delays than having those DAGs split among many files. Just the fact that one
file can only be parsed by one
+ FileProcessor, makes it less scalable for example. If you have many DAGs
generated from one file,
+ consider splitting them if you observe processing and scheduling delays.
Review comment:
I clarified it a bit in upcoming change (to make it clear that it is in
the case you see delays in changes propagated from the files to UI)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]