[GitHub] [airflow] eladkal commented on a change in pull request #18356: Explain scheduler fine-tuning better

GitBox Mon, 20 Sep 2021 08:17:24 -0700


eladkal commented on a change in pull request #18356:
URL: https://github.com/apache/airflow/pull/18356#discussion_r712259107




##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -138,18 +141,172 @@ The following databases are fully supported and provide 
an "optimal" experience:
 
   Microsoft SQLServer has not been tested with HA.
 
+
+Fine-tuning your Scheduler performance
+--------------------------------------
+
+What impacts scheduler's performance
+""""""""""""""""""""""""""""""""""""
+
+The Scheduler is responsible for two operations:
+
+* continuously parsing DAG files and synchronizing with the DAG in the database
+* continuously scheduling tasks for execution
+
+Those two tasks are executed in parallel by the scheduler and run 
independently of each other in
+different processes. In order to fine-tune your scheduler, you need to include 
a number of factors:
+
+* The kind of deployment you have
+    * what kind of filesystem you have to share the DAGs (impacts performance 
of continuous reading DAGs)
+    * how fast the filesystem is (in many cases of distributed cloud 
filesystem you can pay extra to get
+      more throughput/faster filesystem
+    * how much memory you have for your processing
+    * how much CPU you have available
+    * how much networking throughput you have available
+
+* The logic and definition of your DAG structure:
+    * how many DAG files you have
+    * how many DAGs you have in your files
+    * how large the DAG files are (remember scheduler needs to read and parse 
the file every n seconds)
+    * how complex they are
+    * whether parsing your DAGs involves heavy processing (Hint! It should 
not. See :doc:`/best-practices`)
+
+* The scheduler configuration
+   * How many schedulers you have
+   * How many parsing processes you have in your scheduler
+   * How much time scheduler waits between re-parsing of the same DAG (it 
happens continuously)
+   * How many task instances scheduler processes in one loop
+   * How many new DAG runs should be created/scheduled per loop
+   * Whether to execute "mini-scheduler" after completed task to speed up 
scheduling dependent tasks
+   * How often the scheduler should perform cleanup and check for orphaned 
tasks/adopting them
+   * Whether scheduler uses row-level locking
+
+In order to perform fine-tuning, it's good to understand how Scheduler works 
under-the-hood.
+You can take a look at the ``Airflow Summit 2021``
+`Deep Dive into the Airflow Scheduler talk <https://youtu.be/DYC4-xElccE>`_ to 
perform the fine-tuning.
+
+How to approach Scheduler's fine-tuning
+"""""""""""""""""""""""""""""""""""""""
+
+Airflow gives you a lot of "knobs" to turn to fine tune the performance but 
it's a separate task,
+depending on your particular deployment, your DAG structure, hardware 
availability and expectations,
+to decide which knobs to turn to get best effect for you. Part of the job when 
managing the
+deployment is to decide what you are going to optimize for. Some users are ok 
with
+30 seconds delays of new DAG parsing, at the expense of lower CPU usage, 
whereas some other users
+expect the DAGs to be parsed almost instantly when they appear in the DAGs 
folder at the
+expense of higher CPU usage for example.
+
+Airflow gives you the flexibility to decide, but you should find out what 
aspect of performance is
+most important for you and decide which knobs you want to turn in which 
direction.
+
+Generally for fine-tuning, your approach should be the same as for any 
performance improvement and
+optimizations (we will not recommend any specific tools - just use the tools 
that you usually use
+to observe and monitor your systems):
+
+* its extremely important to monitor your system with the right set of tools 
that you usually use to
+  monitor your system. This document does not go into details of particular 
metrics and tools that you
+  can use, it just describes what kind of resources you should monitor, but 
you should follow your best
+  practices for monitoring to grab the right data.
+* decide which aspect of performance is most important for you (what you want 
to improve)
+* observe your system to see where your bottlenecks are: CPU, memory, I/O are 
the usual limiting factors
+* based on your expectations and observations - decide what is your next 
improvement and go back to
+  the observation of your performance, bottlenecks. Performance improvement is 
an iterative process.
+
+What resources might limit Scheduler's performance
+""""""""""""""""""""""""""""""""""""""""""""""""""
+
+There are several areas of resource usage that you should pay attention to:
+
+* FileSystem performance. Airflow Scheduler relies heavily on parsing 
(sometimes a lot) of Python
+  files, which are often located on a shared filesystem. Airflow Scheduler 
continuously reads and
+  re-parses those files. The same files have to be made available to workers, 
so often they are
+  stored in a distributed filesystem. You can use various filesystems for that 
purpose (NFS, CIFS, EFS,
+  GCS fuse, Azure File System are good examples). There are various parameters 
you can control for those
+  filesystems and fine-tune their performance, but this is beyond the scope of 
this document. You should
+  observe statistics and usage of your filesystem to determine if problems 
come from the filesystem
+  performance. For example there are anecdotal evidences that increasing IOPS 
(and paying more) for the
+  EFS performance, dramatically improves stability and speed of parsing 
Airflow DAGs when EFS is used.
+* Another solution to FileSystem performance, if it becomes your bottleneck, 
is to turn to alternative
+  mechanisms of distributing your DAGs. Embedding DAGs in your image and 
GitSync distribution have both
+  the property that the files are available locally for Scheduler and it does 
not have to use a
+  distributed filesystem to read the files, the files are available locally 
for the Scheduler and it is
+  usually as fast as it can be, especially if your machines use fast SSD disks 
for local storage. Those
+  distribution mechanisms have other characteristics that might make them not 
the best choice for you,
+  but if your problems with performance come from distributed filesystem 
performance, they might be the
+  best approach to follow.
+* Database connections and Database usage might become a problem as you want 
to increase performance and
+  process more things in parallel. Airflow is known from being 
"database-connection hungry" - the more DAGs
+  you have and the more you want to process in parallel, the more database 
connections will be opened.
+  This is generally not a problem for MySQL as its model of handling 
connections is thread-based, but this
+  might be a problem for Postgres, where connection handling is process-based. 
It is a general consensus
+  that if you have even medium size Postgres-based Airflow installation, the 
best solution is to use
+  `PGBouncer <https://www.pgbouncer.org/>`_ as a proxy to your database. The 
:doc:`helm-chart:index`
+  supports PGBouncer out-of-the-box. For MsSQL we have not yet worked out the 
best practices as support
+  for MsSQL is still experimental.
+* CPU usage is most important for FileProcessors - those are the processes 
that parse and execute
+  Python DAG files. Since Schedulers triggers such parsing continuously, when 
you have a lot complex DAGs,

Review comment:
       What is a complex DAG? do we mean just high number of tasks?

##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -138,18 +141,172 @@ The following databases are fully supported and provide 
an "optimal" experience:
 
   Microsoft SQLServer has not been tested with HA.
 
+
+Fine-tuning your Scheduler performance
+--------------------------------------
+
+What impacts scheduler's performance
+""""""""""""""""""""""""""""""""""""
+
+The Scheduler is responsible for two operations:
+
+* continuously parsing DAG files and synchronizing with the DAG in the database
+* continuously scheduling tasks for execution
+
+Those two tasks are executed in parallel by the scheduler and run 
independently of each other in
+different processes. In order to fine-tune your scheduler, you need to include 
a number of factors:
+
+* The kind of deployment you have
+    * what kind of filesystem you have to share the DAGs (impacts performance 
of continuous reading DAGs)
+    * how fast the filesystem is (in many cases of distributed cloud 
filesystem you can pay extra to get
+      more throughput/faster filesystem
+    * how much memory you have for your processing
+    * how much CPU you have available
+    * how much networking throughput you have available
+
+* The logic and definition of your DAG structure:
+    * how many DAG files you have
+    * how many DAGs you have in your files
+    * how large the DAG files are (remember scheduler needs to read and parse 
the file every n seconds)
+    * how complex they are
+    * whether parsing your DAGs involves heavy processing (Hint! It should 
not. See :doc:`/best-practices`)
+
+* The scheduler configuration
+   * How many schedulers you have
+   * How many parsing processes you have in your scheduler
+   * How much time scheduler waits between re-parsing of the same DAG (it 
happens continuously)
+   * How many task instances scheduler processes in one loop
+   * How many new DAG runs should be created/scheduled per loop
+   * Whether to execute "mini-scheduler" after completed task to speed up 
scheduling dependent tasks
+   * How often the scheduler should perform cleanup and check for orphaned 
tasks/adopting them
+   * Whether scheduler uses row-level locking
+
+In order to perform fine-tuning, it's good to understand how Scheduler works 
under-the-hood.
+You can take a look at the ``Airflow Summit 2021``
+`Deep Dive into the Airflow Scheduler talk <https://youtu.be/DYC4-xElccE>`_ to 
perform the fine-tuning.
+
+How to approach Scheduler's fine-tuning
+"""""""""""""""""""""""""""""""""""""""
+
+Airflow gives you a lot of "knobs" to turn to fine tune the performance but 
it's a separate task,
+depending on your particular deployment, your DAG structure, hardware 
availability and expectations,
+to decide which knobs to turn to get best effect for you. Part of the job when 
managing the
+deployment is to decide what you are going to optimize for. Some users are ok 
with
+30 seconds delays of new DAG parsing, at the expense of lower CPU usage, 
whereas some other users
+expect the DAGs to be parsed almost instantly when they appear in the DAGs 
folder at the
+expense of higher CPU usage for example.
+
+Airflow gives you the flexibility to decide, but you should find out what 
aspect of performance is
+most important for you and decide which knobs you want to turn in which 
direction.
+
+Generally for fine-tuning, your approach should be the same as for any 
performance improvement and
+optimizations (we will not recommend any specific tools - just use the tools 
that you usually use
+to observe and monitor your systems):
+
+* its extremely important to monitor your system with the right set of tools 
that you usually use to
+  monitor your system. This document does not go into details of particular 
metrics and tools that you
+  can use, it just describes what kind of resources you should monitor, but 
you should follow your best
+  practices for monitoring to grab the right data.
+* decide which aspect of performance is most important for you (what you want 
to improve)
+* observe your system to see where your bottlenecks are: CPU, memory, I/O are 
the usual limiting factors
+* based on your expectations and observations - decide what is your next 
improvement and go back to
+  the observation of your performance, bottlenecks. Performance improvement is 
an iterative process.
+
+What resources might limit Scheduler's performance
+""""""""""""""""""""""""""""""""""""""""""""""""""
+
+There are several areas of resource usage that you should pay attention to:
+
+* FileSystem performance. Airflow Scheduler relies heavily on parsing 
(sometimes a lot) of Python
+  files, which are often located on a shared filesystem. Airflow Scheduler 
continuously reads and
+  re-parses those files. The same files have to be made available to workers, 
so often they are
+  stored in a distributed filesystem. You can use various filesystems for that 
purpose (NFS, CIFS, EFS,
+  GCS fuse, Azure File System are good examples). There are various parameters 
you can control for those
+  filesystems and fine-tune their performance, but this is beyond the scope of 
this document. You should
+  observe statistics and usage of your filesystem to determine if problems 
come from the filesystem
+  performance. For example there are anecdotal evidences that increasing IOPS 
(and paying more) for the
+  EFS performance, dramatically improves stability and speed of parsing 
Airflow DAGs when EFS is used.
+* Another solution to FileSystem performance, if it becomes your bottleneck, 
is to turn to alternative
+  mechanisms of distributing your DAGs. Embedding DAGs in your image and 
GitSync distribution have both
+  the property that the files are available locally for Scheduler and it does 
not have to use a
+  distributed filesystem to read the files, the files are available locally 
for the Scheduler and it is
+  usually as fast as it can be, especially if your machines use fast SSD disks 
for local storage. Those
+  distribution mechanisms have other characteristics that might make them not 
the best choice for you,
+  but if your problems with performance come from distributed filesystem 
performance, they might be the
+  best approach to follow.
+* Database connections and Database usage might become a problem as you want 
to increase performance and
+  process more things in parallel. Airflow is known from being 
"database-connection hungry" - the more DAGs
+  you have and the more you want to process in parallel, the more database 
connections will be opened.
+  This is generally not a problem for MySQL as its model of handling 
connections is thread-based, but this
+  might be a problem for Postgres, where connection handling is process-based. 
It is a general consensus
+  that if you have even medium size Postgres-based Airflow installation, the 
best solution is to use
+  `PGBouncer <https://www.pgbouncer.org/>`_ as a proxy to your database. The 
:doc:`helm-chart:index`
+  supports PGBouncer out-of-the-box. For MsSQL we have not yet worked out the 
best practices as support
+  for MsSQL is still experimental.

Review comment:
       is MsSQL experimental?
   I thought that starting 2.2 we consider it to be stable and fully supported

##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -138,18 +141,172 @@ The following databases are fully supported and provide 
an "optimal" experience:
 
   Microsoft SQLServer has not been tested with HA.
 
+
+Fine-tuning your Scheduler performance
+--------------------------------------
+
+What impacts scheduler's performance
+""""""""""""""""""""""""""""""""""""
+
+The Scheduler is responsible for two operations:
+
+* continuously parsing DAG files and synchronizing with the DAG in the database
+* continuously scheduling tasks for execution
+
+Those two tasks are executed in parallel by the scheduler and run 
independently of each other in
+different processes. In order to fine-tune your scheduler, you need to include 
a number of factors:
+
+* The kind of deployment you have
+    * what kind of filesystem you have to share the DAGs (impacts performance 
of continuous reading DAGs)
+    * how fast the filesystem is (in many cases of distributed cloud 
filesystem you can pay extra to get
+      more throughput/faster filesystem
+    * how much memory you have for your processing
+    * how much CPU you have available
+    * how much networking throughput you have available
+
+* The logic and definition of your DAG structure:
+    * how many DAG files you have
+    * how many DAGs you have in your files
+    * how large the DAG files are (remember scheduler needs to read and parse 
the file every n seconds)
+    * how complex they are
+    * whether parsing your DAGs involves heavy processing (Hint! It should 
not. See :doc:`/best-practices`)
+
+* The scheduler configuration
+   * How many schedulers you have
+   * How many parsing processes you have in your scheduler
+   * How much time scheduler waits between re-parsing of the same DAG (it 
happens continuously)
+   * How many task instances scheduler processes in one loop
+   * How many new DAG runs should be created/scheduled per loop
+   * Whether to execute "mini-scheduler" after completed task to speed up 
scheduling dependent tasks
+   * How often the scheduler should perform cleanup and check for orphaned 
tasks/adopting them
+   * Whether scheduler uses row-level locking
+
+In order to perform fine-tuning, it's good to understand how Scheduler works 
under-the-hood.
+You can take a look at the ``Airflow Summit 2021``
+`Deep Dive into the Airflow Scheduler talk <https://youtu.be/DYC4-xElccE>`_ to 
perform the fine-tuning.
+
+How to approach Scheduler's fine-tuning
+"""""""""""""""""""""""""""""""""""""""
+
+Airflow gives you a lot of "knobs" to turn to fine tune the performance but 
it's a separate task,
+depending on your particular deployment, your DAG structure, hardware 
availability and expectations,
+to decide which knobs to turn to get best effect for you. Part of the job when 
managing the
+deployment is to decide what you are going to optimize for. Some users are ok 
with
+30 seconds delays of new DAG parsing, at the expense of lower CPU usage, 
whereas some other users
+expect the DAGs to be parsed almost instantly when they appear in the DAGs 
folder at the
+expense of higher CPU usage for example.
+
+Airflow gives you the flexibility to decide, but you should find out what 
aspect of performance is
+most important for you and decide which knobs you want to turn in which 
direction.
+
+Generally for fine-tuning, your approach should be the same as for any 
performance improvement and
+optimizations (we will not recommend any specific tools - just use the tools 
that you usually use
+to observe and monitor your systems):
+
+* its extremely important to monitor your system with the right set of tools 
that you usually use to
+  monitor your system. This document does not go into details of particular 
metrics and tools that you
+  can use, it just describes what kind of resources you should monitor, but 
you should follow your best
+  practices for monitoring to grab the right data.
+* decide which aspect of performance is most important for you (what you want 
to improve)
+* observe your system to see where your bottlenecks are: CPU, memory, I/O are 
the usual limiting factors
+* based on your expectations and observations - decide what is your next 
improvement and go back to
+  the observation of your performance, bottlenecks. Performance improvement is 
an iterative process.
+
+What resources might limit Scheduler's performance
+""""""""""""""""""""""""""""""""""""""""""""""""""
+
+There are several areas of resource usage that you should pay attention to:
+
+* FileSystem performance. Airflow Scheduler relies heavily on parsing 
(sometimes a lot) of Python
+  files, which are often located on a shared filesystem. Airflow Scheduler 
continuously reads and
+  re-parses those files. The same files have to be made available to workers, 
so often they are
+  stored in a distributed filesystem. You can use various filesystems for that 
purpose (NFS, CIFS, EFS,
+  GCS fuse, Azure File System are good examples). There are various parameters 
you can control for those
+  filesystems and fine-tune their performance, but this is beyond the scope of 
this document. You should
+  observe statistics and usage of your filesystem to determine if problems 
come from the filesystem
+  performance. For example there are anecdotal evidences that increasing IOPS 
(and paying more) for the
+  EFS performance, dramatically improves stability and speed of parsing 
Airflow DAGs when EFS is used.
+* Another solution to FileSystem performance, if it becomes your bottleneck, 
is to turn to alternative
+  mechanisms of distributing your DAGs. Embedding DAGs in your image and 
GitSync distribution have both
+  the property that the files are available locally for Scheduler and it does 
not have to use a
+  distributed filesystem to read the files, the files are available locally 
for the Scheduler and it is
+  usually as fast as it can be, especially if your machines use fast SSD disks 
for local storage. Those
+  distribution mechanisms have other characteristics that might make them not 
the best choice for you,
+  but if your problems with performance come from distributed filesystem 
performance, they might be the
+  best approach to follow.
+* Database connections and Database usage might become a problem as you want 
to increase performance and
+  process more things in parallel. Airflow is known from being 
"database-connection hungry" - the more DAGs
+  you have and the more you want to process in parallel, the more database 
connections will be opened.
+  This is generally not a problem for MySQL as its model of handling 
connections is thread-based, but this
+  might be a problem for Postgres, where connection handling is process-based. 
It is a general consensus
+  that if you have even medium size Postgres-based Airflow installation, the 
best solution is to use
+  `PGBouncer <https://www.pgbouncer.org/>`_ as a proxy to your database. The 
:doc:`helm-chart:index`
+  supports PGBouncer out-of-the-box. For MsSQL we have not yet worked out the 
best practices as support
+  for MsSQL is still experimental.
+* CPU usage is most important for FileProcessors - those are the processes 
that parse and execute
+  Python DAG files. Since Schedulers triggers such parsing continuously, when 
you have a lot complex DAGs,
+  the processing might take a lot of CPU. You can mitigate it by decreasing the
+  :ref:`config:scheduler__min_file_process_interval`, but this is one of the 
mentioned trade-offs,
+  result of this is that changes to such files will be picked up slower and 
you will see delays between
+  submitting the files and getting them available in Airflow UI and executed 
by Scheduler. Optimizing
+  the way how your DAGs are built, avoiding external data sources is your best 
approach to improve CPU
+  usage. If you have more CPUs available, you can increase number of 
processing threads
+  :ref:`config:scheduler__parsing_processes`, Also Airflow Scheduler scales 
almost linearly with
+  several instances, so you can also add more Schedulers if your Scheduler's 
performance is CPU-bound.
+* Airflow might use quite significant amount of memory when you try to get 
more performance out of it.
+  Often more performance is achieved in Airflow by increasing number of 
processes handling the load,
+  and each process requires whole interpreter of Python loaded, a lot of 
classes imported, temporary
+  in-memory storage. This can lead to memory pressure. You need to observe if 
your system is not using
+  more memory than it has - which results with using swap disk, which 
dramatically decreases performance.
+  Note that Airflow Scheduler in versions prior to ``2.1.4`` generated a lot 
of ``Page Cache`` memory
+  used by log files (when the log files were not removed). This was generally 
harmless, as the memory
+  is just cache and could be reclaimed at any time by the system, however in 
version ``2.1.4`` and
+  beyond, writing logs will not generate excessive ``Page Cache`` memory. 
Regardless - make sure when you look
+  at memory usage, pay attention to the kind of memory you are observing. 
Usually you should look at
+  ``working memory``(names might vary depending on your deployment) rather 
than ``total memory used``.
+
+What can you do, to improve Scheduler's performance
+"""""""""""""""""""""""""""""""""""""""""""""""""""
+
+When you know what your resource usage is, the improvements that you can 
consider might be:
+
+* improve the logic, efficiency of parsing and reduce complexity of your DAG 
Python code. It is parsed
+  continuously so optimizing that code might bring tremendous improvements, 
especially if you try
+  to reach out to some external databases etc. while parsing DAGs (this should 
be avoided at all cost).
+  The :doc:`/best-practices` document shows a few examples on how you can 
approach dynamic DAG parsing
+  without reaching out to external sources.
+* improve utilization of your resources. This is when you have a free capacity 
in your system that
+  seems underutilized (again CPU, memory I/O, networking are the prime 
candidates) - you can take
+  actions like increasing number of schedulers, parsing processes or 
decreasing intervals for more
+  frequent actions might bring improvements in performance at the expense of 
higher utilization of those.

Review comment:
       Do we offer some way of identifying these?
   For example I use a test to check how much time it takes to load the dag and 
from that I deduce if an expensive call was made in operator init. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [airflow] eladkal commented on a change in pull request #18356: Explain scheduler fine-tuning better

Reply via email to