edwardwang888 commented on a change in pull request #18356:
URL: https://github.com/apache/airflow/pull/18356#discussion_r711870134
##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -138,12 +141,101 @@ The following databases are fully supported and provide
an "optimal" experience:
Microsoft SQLServer has not been tested with HA.
+
+Fine-tuning your Scheduler
+--------------------------
+
+When you deploy Airflow in production you often would like to optimize its
performance and
+fine-tune Scheduler behaviour. Firs of all you need to remember that Scheduler
performs two
+operations:
+
+* continuously parses DAG files and updates their starts in ``Serialized DAG``
form in the database
+* continuously finds and schedules for execution the next tasks to run and
sends those tasks for
+ execution to the executor you have configured
+
+Those two tasks are executed in parallel by scheduler, they are fairly
independent from each other and
+they are run using different processes. You can fine tune the behaviour of
both components, however
+in order to fine-tune your scheduler, you need to included a number of factors:
+
+* The kind of deployment you have
+ * what kind of filesystem you have to share the DAGS
+ * how fast the filesystem is (in many cases of distributed cloud
filesystem you can pay extra to get
+ more throughput/faster filesystem
+ * how much memory you have for your processing
+ * how much CPU you have available
+ * how much networking throughput you have available
+
+* The logic and definition of your DAG structure:
+ * how many DAG files you have
+ * how many DAGs you have in your files
+ * how large the DAG files are (remember scheduler needs to read and parse
the file every n seconds)
+ * how complex they are
+ * whether parsing your DAGs involves heavy processing (Hint! It should
not. See:doc:`/best-practices`)
+
+* The scheduler configuration
+ * How many schedulers you have
+ * How many parsing processes you have in your scheduler
+ * How much time scheduler waits between re-parsing of the same DAG (it
happens continuously)
+ * How many task instances scheduler processes in one loop
+ * How many new dag runs should be created/scheduled per loop
+ * Whether to execute "mini-scheduler" after completed task to speed up
scheduling dependent tasks
+ * How often the scheduler should perform cleanup and check for orphaned
tasks/adopting them
+ * Whether scheduler uses row-level locking
+
+
+Airflow gives you a lot of "knobs" to turn to fine tune the performance but
it's a separate task,
+depending on your particular deployment, your DAG structure, hardware
availability and expectations,
+to decide which knobs to turn to get best effect for you. Part of the job when
managing the
+deployment is to decide what you are going to optimize for. Some users are ok
with
+30 seconds delays of new DAG parsing, at the expense of lower CPU usage, where
some other users
Review comment:
```suggestion
30 seconds delays of new DAG parsing, at the expense of lower CPU usage,
whereas some other users
```
##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -138,12 +141,101 @@ The following databases are fully supported and provide
an "optimal" experience:
Microsoft SQLServer has not been tested with HA.
+
+Fine-tuning your Scheduler
+--------------------------
+
+When you deploy Airflow in production you often would like to optimize its
performance and
+fine-tune Scheduler behaviour. Firs of all you need to remember that Scheduler
performs two
+operations:
+
+* continuously parses DAG files and updates their starts in ``Serialized DAG``
form in the database
+* continuously finds and schedules for execution the next tasks to run and
sends those tasks for
+ execution to the executor you have configured
+
+Those two tasks are executed in parallel by scheduler, they are fairly
independent from each other and
+they are run using different processes. You can fine tune the behaviour of
both components, however
+in order to fine-tune your scheduler, you need to included a number of factors:
+
+* The kind of deployment you have
+ * what kind of filesystem you have to share the DAGS
Review comment:
```suggestion
* what kind of filesystem you have to share the DAGs
```
##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -138,12 +141,101 @@ The following databases are fully supported and provide
an "optimal" experience:
Microsoft SQLServer has not been tested with HA.
+
+Fine-tuning your Scheduler
+--------------------------
+
+When you deploy Airflow in production you often would like to optimize its
performance and
+fine-tune Scheduler behaviour. Firs of all you need to remember that Scheduler
performs two
+operations:
+
+* continuously parses DAG files and updates their starts in ``Serialized DAG``
form in the database
+* continuously finds and schedules for execution the next tasks to run and
sends those tasks for
+ execution to the executor you have configured
+
+Those two tasks are executed in parallel by scheduler, they are fairly
independent from each other and
+they are run using different processes. You can fine tune the behaviour of
both components, however
+in order to fine-tune your scheduler, you need to included a number of factors:
+
+* The kind of deployment you have
+ * what kind of filesystem you have to share the DAGS
+ * how fast the filesystem is (in many cases of distributed cloud
filesystem you can pay extra to get
+ more throughput/faster filesystem
+ * how much memory you have for your processing
+ * how much CPU you have available
+ * how much networking throughput you have available
+
+* The logic and definition of your DAG structure:
+ * how many DAG files you have
+ * how many DAGs you have in your files
+ * how large the DAG files are (remember scheduler needs to read and parse
the file every n seconds)
+ * how complex they are
+ * whether parsing your DAGs involves heavy processing (Hint! It should
not. See:doc:`/best-practices`)
+
+* The scheduler configuration
+ * How many schedulers you have
+ * How many parsing processes you have in your scheduler
+ * How much time scheduler waits between re-parsing of the same DAG (it
happens continuously)
+ * How many task instances scheduler processes in one loop
+ * How many new dag runs should be created/scheduled per loop
+ * Whether to execute "mini-scheduler" after completed task to speed up
scheduling dependent tasks
+ * How often the scheduler should perform cleanup and check for orphaned
tasks/adopting them
+ * Whether scheduler uses row-level locking
+
+
+Airflow gives you a lot of "knobs" to turn to fine tune the performance but
it's a separate task,
+depending on your particular deployment, your DAG structure, hardware
availability and expectations,
+to decide which knobs to turn to get best effect for you. Part of the job when
managing the
+deployment is to decide what you are going to optimize for. Some users are ok
with
+30 seconds delays of new DAG parsing, at the expense of lower CPU usage, where
some other users
+expect the DAGs to be parsed almost instantly when they appear in the DAGs
folder at the
+expense of higher CPU usage for example.
+
+Airflow gives you the flexibility to decide, but you should find out what
aspect of performance is
+most important for you and decide which knobs you want to turn in which
direction.
+
+Generally for fine-tuning, your approach should be the same as for any
performance improvement and
+optimizations (we will not recommend any specific tools - just use the tools
that you usually use
+to observe and monitor your systems):
+
+* decide which aspect of performance is most important for you (what you want
to improve)
+* observe your system to see where your bottlenecks are: CPU, memory, I/O are
the usual limiting factors
+* based on your expectations and observations - decide what is your next
improvement and go back to
+ the observation of your performance, bottlenecks. Performance improvement is
an iterative process
+
+The improvements that you can consider are:
+
+* improve utilization of your resources. This is when you have a free capacity
in your system that
+ seems underutilized (again CPU, memory I/O, networking are the prime
candidates) - you can take
+ actions like increasing number of schedulers, parsing processes or
decreasing intervals for more
+ frequent actions might bring improvements in performance at the expense of
higher utilization of those.
+* increase hardware capacity (for example if you see that CPU is limiting you
or tha I/O you use for
+ DAG filesystem is at its limits). Often the problem with scheduler
performance is
+ simply because your system is not "capable" enough and this might be the
only way. For example if
+ you see that you are using all CPU you have on machine, you might want to
add another scheduler on
+ a new machine - in most cases, when you add 2nd or 3rd scheduler, the
capacity of scheduling grows
+ linearly (unless the shared database or filesystem is a bottleneck).
+* experiment with different values for the "scheduler tunables". Often you
might get better effects by
+ simply exchanging one performance aspect for another. For example if you
want to decrease the
+ cpu usage, you might increase file processing interval (but the result will
be that new DAGs will
Review comment:
```suggestion
CPU usage, you might increase file processing interval (but the result
will be that new DAGs will
```
##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -187,3 +279,36 @@ The following config settings can be used to control
aspects of the Scheduler HA
:ref:`config:scheduler__scheduler_health_check_threshold`) any running or
queued tasks that were launched by the dead process will be "adopted" and
monitored by this scheduler instead.
+
+- :ref:`config:scheduler__dag_dir_list_interval`
+ How often (in seconds) to scan the DAGs directory for new files.
+
+- :ref:`config:scheduler__file_parsing_sort_mode`
+ The scheduler will list and sort the dag files to decide the parsing order.
+
+- :ref:`config:scheduler__max_tis_per_query`
+ The batch size of queries in the scheduling main loop. If this is too high,
SQL query
+ performance may be impacted by one or more of the following:
+
+ - reversion to full table scan - complexity of query predicate
+ - excessive locking
+
+ Additionally, you may hit the maximum allowable query length for your db.
+ Set this to 0 for no limit (not advised)
+
+- :ref:`config:scheduler__min_file_process_interval`
+ Number of seconds after which a DAG file is parsed. The DAG file is parsed
every
+ min_file_process_interval number of seconds. Updates to DAGs are reflected
after
+ this interval. Keeping this number low will increase CPU usage.
+
+- :ref:`config:scheduler__parsing_processes`
+ The scheduler can run multiple processes in parallel to parse dags. This
defines
Review comment:
```suggestion
The scheduler can run multiple processes in parallel to parse DAGs. This
defines
```
##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -138,12 +141,101 @@ The following databases are fully supported and provide
an "optimal" experience:
Microsoft SQLServer has not been tested with HA.
+
+Fine-tuning your Scheduler
+--------------------------
+
+When you deploy Airflow in production you often would like to optimize its
performance and
+fine-tune Scheduler behaviour. Firs of all you need to remember that Scheduler
performs two
Review comment:
```suggestion
fine-tune Scheduler behaviour. First of all you need to remember that
Scheduler performs two
```
##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -138,12 +141,101 @@ The following databases are fully supported and provide
an "optimal" experience:
Microsoft SQLServer has not been tested with HA.
+
+Fine-tuning your Scheduler
+--------------------------
+
+When you deploy Airflow in production you often would like to optimize its
performance and
+fine-tune Scheduler behaviour. Firs of all you need to remember that Scheduler
performs two
+operations:
+
+* continuously parses DAG files and updates their starts in ``Serialized DAG``
form in the database
+* continuously finds and schedules for execution the next tasks to run and
sends those tasks for
+ execution to the executor you have configured
+
+Those two tasks are executed in parallel by scheduler, they are fairly
independent from each other and
+they are run using different processes. You can fine tune the behaviour of
both components, however
+in order to fine-tune your scheduler, you need to included a number of factors:
+
+* The kind of deployment you have
+ * what kind of filesystem you have to share the DAGS
+ * how fast the filesystem is (in many cases of distributed cloud
filesystem you can pay extra to get
+ more throughput/faster filesystem
+ * how much memory you have for your processing
+ * how much CPU you have available
+ * how much networking throughput you have available
+
+* The logic and definition of your DAG structure:
+ * how many DAG files you have
+ * how many DAGs you have in your files
+ * how large the DAG files are (remember scheduler needs to read and parse
the file every n seconds)
+ * how complex they are
+ * whether parsing your DAGs involves heavy processing (Hint! It should
not. See:doc:`/best-practices`)
+
+* The scheduler configuration
+ * How many schedulers you have
+ * How many parsing processes you have in your scheduler
+ * How much time scheduler waits between re-parsing of the same DAG (it
happens continuously)
+ * How many task instances scheduler processes in one loop
+ * How many new dag runs should be created/scheduled per loop
Review comment:
```suggestion
* How many new DAG runs should be created/scheduled per loop
```
##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -138,12 +141,101 @@ The following databases are fully supported and provide
an "optimal" experience:
Microsoft SQLServer has not been tested with HA.
+
+Fine-tuning your Scheduler
+--------------------------
+
+When you deploy Airflow in production you often would like to optimize its
performance and
+fine-tune Scheduler behaviour. Firs of all you need to remember that Scheduler
performs two
+operations:
+
+* continuously parses DAG files and updates their starts in ``Serialized DAG``
form in the database
+* continuously finds and schedules for execution the next tasks to run and
sends those tasks for
+ execution to the executor you have configured
+
+Those two tasks are executed in parallel by scheduler, they are fairly
independent from each other and
+they are run using different processes. You can fine tune the behaviour of
both components, however
+in order to fine-tune your scheduler, you need to included a number of factors:
+
+* The kind of deployment you have
+ * what kind of filesystem you have to share the DAGS
+ * how fast the filesystem is (in many cases of distributed cloud
filesystem you can pay extra to get
+ more throughput/faster filesystem
+ * how much memory you have for your processing
+ * how much CPU you have available
+ * how much networking throughput you have available
+
+* The logic and definition of your DAG structure:
+ * how many DAG files you have
+ * how many DAGs you have in your files
+ * how large the DAG files are (remember scheduler needs to read and parse
the file every n seconds)
+ * how complex they are
+ * whether parsing your DAGs involves heavy processing (Hint! It should
not. See:doc:`/best-practices`)
+
+* The scheduler configuration
+ * How many schedulers you have
+ * How many parsing processes you have in your scheduler
+ * How much time scheduler waits between re-parsing of the same DAG (it
happens continuously)
+ * How many task instances scheduler processes in one loop
+ * How many new dag runs should be created/scheduled per loop
+ * Whether to execute "mini-scheduler" after completed task to speed up
scheduling dependent tasks
+ * How often the scheduler should perform cleanup and check for orphaned
tasks/adopting them
+ * Whether scheduler uses row-level locking
+
+
+Airflow gives you a lot of "knobs" to turn to fine tune the performance but
it's a separate task,
+depending on your particular deployment, your DAG structure, hardware
availability and expectations,
+to decide which knobs to turn to get best effect for you. Part of the job when
managing the
+deployment is to decide what you are going to optimize for. Some users are ok
with
+30 seconds delays of new DAG parsing, at the expense of lower CPU usage, where
some other users
+expect the DAGs to be parsed almost instantly when they appear in the DAGs
folder at the
+expense of higher CPU usage for example.
+
+Airflow gives you the flexibility to decide, but you should find out what
aspect of performance is
+most important for you and decide which knobs you want to turn in which
direction.
+
+Generally for fine-tuning, your approach should be the same as for any
performance improvement and
+optimizations (we will not recommend any specific tools - just use the tools
that you usually use
+to observe and monitor your systems):
+
+* decide which aspect of performance is most important for you (what you want
to improve)
+* observe your system to see where your bottlenecks are: CPU, memory, I/O are
the usual limiting factors
+* based on your expectations and observations - decide what is your next
improvement and go back to
+ the observation of your performance, bottlenecks. Performance improvement is
an iterative process
Review comment:
```suggestion
the observation of your performance, bottlenecks. Performance improvement
is an iterative process.
```
##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -138,12 +141,101 @@ The following databases are fully supported and provide
an "optimal" experience:
Microsoft SQLServer has not been tested with HA.
+
+Fine-tuning your Scheduler
+--------------------------
+
+When you deploy Airflow in production you often would like to optimize its
performance and
+fine-tune Scheduler behaviour. Firs of all you need to remember that Scheduler
performs two
+operations:
+
+* continuously parses DAG files and updates their starts in ``Serialized DAG``
form in the database
+* continuously finds and schedules for execution the next tasks to run and
sends those tasks for
+ execution to the executor you have configured
+
+Those two tasks are executed in parallel by scheduler, they are fairly
independent from each other and
+they are run using different processes. You can fine tune the behaviour of
both components, however
+in order to fine-tune your scheduler, you need to included a number of factors:
+
+* The kind of deployment you have
+ * what kind of filesystem you have to share the DAGS
+ * how fast the filesystem is (in many cases of distributed cloud
filesystem you can pay extra to get
+ more throughput/faster filesystem
+ * how much memory you have for your processing
+ * how much CPU you have available
+ * how much networking throughput you have available
+
+* The logic and definition of your DAG structure:
+ * how many DAG files you have
+ * how many DAGs you have in your files
+ * how large the DAG files are (remember scheduler needs to read and parse
the file every n seconds)
+ * how complex they are
+ * whether parsing your DAGs involves heavy processing (Hint! It should
not. See:doc:`/best-practices`)
+
+* The scheduler configuration
+ * How many schedulers you have
+ * How many parsing processes you have in your scheduler
+ * How much time scheduler waits between re-parsing of the same DAG (it
happens continuously)
+ * How many task instances scheduler processes in one loop
+ * How many new dag runs should be created/scheduled per loop
+ * Whether to execute "mini-scheduler" after completed task to speed up
scheduling dependent tasks
+ * How often the scheduler should perform cleanup and check for orphaned
tasks/adopting them
+ * Whether scheduler uses row-level locking
+
+
+Airflow gives you a lot of "knobs" to turn to fine tune the performance but
it's a separate task,
+depending on your particular deployment, your DAG structure, hardware
availability and expectations,
+to decide which knobs to turn to get best effect for you. Part of the job when
managing the
+deployment is to decide what you are going to optimize for. Some users are ok
with
+30 seconds delays of new DAG parsing, at the expense of lower CPU usage, where
some other users
+expect the DAGs to be parsed almost instantly when they appear in the DAGs
folder at the
+expense of higher CPU usage for example.
+
+Airflow gives you the flexibility to decide, but you should find out what
aspect of performance is
+most important for you and decide which knobs you want to turn in which
direction.
+
+Generally for fine-tuning, your approach should be the same as for any
performance improvement and
+optimizations (we will not recommend any specific tools - just use the tools
that you usually use
+to observe and monitor your systems):
+
+* decide which aspect of performance is most important for you (what you want
to improve)
+* observe your system to see where your bottlenecks are: CPU, memory, I/O are
the usual limiting factors
+* based on your expectations and observations - decide what is your next
improvement and go back to
+ the observation of your performance, bottlenecks. Performance improvement is
an iterative process
+
+The improvements that you can consider are:
+
+* improve utilization of your resources. This is when you have a free capacity
in your system that
+ seems underutilized (again CPU, memory I/O, networking are the prime
candidates) - you can take
+ actions like increasing number of schedulers, parsing processes or
decreasing intervals for more
+ frequent actions might bring improvements in performance at the expense of
higher utilization of those.
+* increase hardware capacity (for example if you see that CPU is limiting you
or tha I/O you use for
Review comment:
```suggestion
* increase hardware capacity (for example if you see that CPU is limiting
you or that I/O you use for
```
##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -187,3 +279,36 @@ The following config settings can be used to control
aspects of the Scheduler HA
:ref:`config:scheduler__scheduler_health_check_threshold`) any running or
queued tasks that were launched by the dead process will be "adopted" and
monitored by this scheduler instead.
+
+- :ref:`config:scheduler__dag_dir_list_interval`
+ How often (in seconds) to scan the DAGs directory for new files.
+
+- :ref:`config:scheduler__file_parsing_sort_mode`
+ The scheduler will list and sort the dag files to decide the parsing order.
Review comment:
```suggestion
The scheduler will list and sort the DAG files to decide the parsing order.
```
##########
File path: docs/apache-airflow/concepts/scheduler.rst
##########
@@ -187,3 +279,36 @@ The following config settings can be used to control
aspects of the Scheduler HA
:ref:`config:scheduler__scheduler_health_check_threshold`) any running or
queued tasks that were launched by the dead process will be "adopted" and
monitored by this scheduler instead.
+
+- :ref:`config:scheduler__dag_dir_list_interval`
+ How often (in seconds) to scan the DAGs directory for new files.
+
+- :ref:`config:scheduler__file_parsing_sort_mode`
+ The scheduler will list and sort the dag files to decide the parsing order.
+
+- :ref:`config:scheduler__max_tis_per_query`
+ The batch size of queries in the scheduling main loop. If this is too high,
SQL query
+ performance may be impacted by one or more of the following:
+
+ - reversion to full table scan - complexity of query predicate
+ - excessive locking
+
+ Additionally, you may hit the maximum allowable query length for your db.
+ Set this to 0 for no limit (not advised)
+
+- :ref:`config:scheduler__min_file_process_interval`
+ Number of seconds after which a DAG file is parsed. The DAG file is parsed
every
+ min_file_process_interval number of seconds. Updates to DAGs are reflected
after
+ this interval. Keeping this number low will increase CPU usage.
+
+- :ref:`config:scheduler__parsing_processes`
+ The scheduler can run multiple processes in parallel to parse dags. This
defines
+ how many processes will run.
+
+- :ref:`config:scheduler__processor_poll_interval`
+ The number of seconds to wait between consecutive DAG file processing
+
+- :ref:`config:scheduler__schedule_after_task_execution`
+ Should the Task supervisor process perform a “mini scheduler” to attempt to
schedule more tasks of
+ the same DAG. Leaving this on will mean tasks in the same DAG execute
quicker,
+ but might starve out other dags in some circumstances
Review comment:
```suggestion
but might starve out other DAGs in some circumstances
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]