[GitHub] [airflow] potiuk commented on a change in pull request #18356: Explain scheduler fine-tuning better

GitBox Tue, 21 Sep 2021 03:51:59 -0700


potiuk commented on a change in pull request #18356:
URL: https://github.com/apache/airflow/pull/18356#discussion_r712922936




##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -138,18 +141,173 @@ The following databases are fully supported and provide 
an "optimal" experience:
 
   Microsoft SQLServer has not been tested with HA.
 
+
+Fine-tuning your Scheduler performance
+--------------------------------------
+
+What impacts scheduler's performance
+""""""""""""""""""""""""""""""""""""
+
+The Scheduler is responsible for two operations:
+
+* continuously parsing DAG files and synchronizing with the DAG in the database
+* continuously scheduling tasks for execution
+
+Those two tasks are executed in parallel by the scheduler and run 
independently of each other in
+different processes. In order to fine-tune your scheduler, you need to include 
a number of factors:
+
+* The kind of deployment you have
+    * what kind of filesystem you have to share the DAGs (impacts performance 
of continuously reading DAGs)
+    * how fast the filesystem is (in many cases of distributed cloud 
filesystem you can pay extra to get
+      more throughput/faster filesystem
+    * how much memory you have for your processing
+    * how much CPU you have available
+    * how much networking throughput you have available
+
+* The logic and definition of your DAG structure:
+    * how many DAG files you have
+    * how many DAGs you have in your files
+    * how large the DAG files are (remember scheduler needs to read and parse 
the file every n seconds)
+    * how complex they are (i.e. how fast they can be parsed, how many tasks 
and dependencies they have)
+    * whether parsing your DAGs involves heavy processing (Hint! It should 
not. See :doc:`/best-practices`)
+
+* The scheduler configuration
+   * How many schedulers you have
+   * How many parsing processes you have in your scheduler
+   * How much time scheduler waits between re-parsing of the same DAG (it 
happens continuously)
+   * How many task instances scheduler processes in one loop
+   * How many new DAG runs should be created/scheduled per loop
+   * Whether to execute "mini-scheduler" after completed task to speed up 
scheduling dependent tasks
+   * How often the scheduler should perform cleanup and check for orphaned 
tasks/adopting them
+   * Whether scheduler uses row-level locking
+
+In order to perform fine-tuning, it's good to understand how Scheduler works 
under-the-hood.
+You can take a look at the ``Airflow Summit 2021``
+`Deep Dive into the Airflow Scheduler talk <https://youtu.be/DYC4-xElccE>`_ to 
perform the fine-tuning.
+
+How to approach Scheduler's fine-tuning
+"""""""""""""""""""""""""""""""""""""""
+
+Airflow gives you a lot of "knobs" to turn to fine tune the performance but 
it's a separate task,
+depending on your particular deployment, your DAG structure, hardware 
availability and expectations,
+to decide which knobs to turn to get best effect for you. Part of the job when 
managing the
+deployment is to decide what you are going to optimize for. Some users are ok 
with
+30 seconds delays of new DAG parsing, at the expense of lower CPU usage, 
whereas some other users
+expect the DAGs to be parsed almost instantly when they appear in the DAGs 
folder at the
+expense of higher CPU usage for example.
+
+Airflow gives you the flexibility to decide, but you should find out what 
aspect of performance is
+most important for you and decide which knobs you want to turn in which 
direction.
+
+Generally for fine-tuning, your approach should be the same as for any 
performance improvement and
+optimizations (we will not recommend any specific tools - just use the tools 
that you usually use
+to observe and monitor your systems):
+
+* its extremely important to monitor your system with the right set of tools 
that you usually use to
+  monitor your system. This document does not go into details of particular 
metrics and tools that you
+  can use, it just describes what kind of resources you should monitor, but 
you should follow your best
+  practices for monitoring to grab the right data.
+* decide which aspect of performance is most important for you (what you want 
to improve)
+* observe your system to see where your bottlenecks are: CPU, memory, I/O are 
the usual limiting factors
+* based on your expectations and observations - decide what is your next 
improvement and go back to
+  the observation of your performance, bottlenecks. Performance improvement is 
an iterative process.
+
+What resources might limit Scheduler's performance
+""""""""""""""""""""""""""""""""""""""""""""""""""
+
+There are several areas of resource usage that you should pay attention to:
+
+* FileSystem performance. Airflow Scheduler relies heavily on parsing 
(sometimes a lot) of Python

Review comment:
       I think It does make a lot of difference (at least in perception of 
performance). There are really (mostly anecdotal but I saw it several time) 
evidences of how  - for example buing extra EFS IOPS improved both perceived 
performance and stability of scheduler. I've seen many people who believe that 
by employing a distributed filesystem, they "magically" got  instantly 
distributed files, which - especially with cloud Filesystem like EFS is not at 
all true. Also when working with Composer - they used GCS fuse under the hood 
and it caused a looot of troubles and stability/perceived performance issues 
before it was properly optimized and fine-tuned - and even then there were 
cases which make it work terribly (for example changing one line in a 100s of 
huge DAG files will require to effectively download all of those files from 
scratch and it might take minutes, causing all kind of problems where the files 
are in inconsistent state. 
   
   I think we need to make people aware that choosing the right Filesystem 
matters and that it has huge impact on perceived  performance in a number of 
cases.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [airflow] potiuk commented on a change in pull request #18356: Explain scheduler fine-tuning better

Reply via email to