BasPH commented on code in PR #36513:
URL: https://github.com/apache/airflow/pull/36513#discussion_r1439322639
##########
docs/apache-airflow/core-concepts/overview.rst:
##########
@@ -18,49 +18,151 @@
Architecture Overview
=====================
-Airflow is a platform that lets you build and run *workflows*. A workflow is
represented as a :doc:`DAG <dags>` (a Directed Acyclic Graph), and contains
individual pieces of work called :doc:`tasks`, arranged with dependencies and
data flows taken into account.
+Airflow is a platform that lets you build and run *workflows*. A workflow is
represented as a
+:doc:`DAG <dags>` (a Directed Acyclic Graph), and contains individual pieces
of work called
+:doc:`tasks`, arranged with dependencies and data flows taken into account.
.. image:: ../img/edge_label_example.png
:alt: An example Airflow DAG, rendered in Graph
-A DAG specifies the dependencies between Tasks, and the order in which to
execute them and run retries; the Tasks themselves describe what to do, be it
fetching data, running analysis, triggering other systems, or more.
+A DAG specifies the dependencies between Tasks, and the order in which to
execute them and run retries;
+the Tasks themselves describe what to do, be it fetching data, running
analysis, triggering other systems,
+or more.
-An Airflow installation generally consists of the following components:
+Airflow components
+------------------
Review Comment:
```suggestion
------------------
Airflow's architecture consists of multiple components. The following
sections describe each component's function and whether they're required for a
bare-minimum Airflow installation, or an optional component to achieve better
Airflow extensibility, performance, and scalability.
```
It's a best practice not to stack headings. Let's write at least a short
descriptive text explaining what's coming.
##########
docs/apache-airflow/core-concepts/overview.rst:
##########
@@ -18,49 +18,151 @@
Architecture Overview
=====================
-Airflow is a platform that lets you build and run *workflows*. A workflow is
represented as a :doc:`DAG <dags>` (a Directed Acyclic Graph), and contains
individual pieces of work called :doc:`tasks`, arranged with dependencies and
data flows taken into account.
+Airflow is a platform that lets you build and run *workflows*. A workflow is
represented as a
+:doc:`DAG <dags>` (a Directed Acyclic Graph), and contains individual pieces
of work called
+:doc:`tasks`, arranged with dependencies and data flows taken into account.
.. image:: ../img/edge_label_example.png
:alt: An example Airflow DAG, rendered in Graph
-A DAG specifies the dependencies between Tasks, and the order in which to
execute them and run retries; the Tasks themselves describe what to do, be it
fetching data, running analysis, triggering other systems, or more.
+A DAG specifies the dependencies between Tasks, and the order in which to
execute them and run retries;
+the Tasks themselves describe what to do, be it fetching data, running
analysis, triggering other systems,
+or more.
Review Comment:
```suggestion
A DAG specifies the dependencies between tasks, which defines the order in
which to execute the tasks. Tasks describe what to do, be it fetching data,
running analysis, triggering other systems, or more.
```
Small rewording for readability. Retries can also be defined on task-level +
I don't think retries are critical to the Airflow architecture so I'd leave it
out here.
##########
docs/apache-airflow/core-concepts/overview.rst:
##########
@@ -18,49 +18,151 @@
Architecture Overview
=====================
-Airflow is a platform that lets you build and run *workflows*. A workflow is
represented as a :doc:`DAG <dags>` (a Directed Acyclic Graph), and contains
individual pieces of work called :doc:`tasks`, arranged with dependencies and
data flows taken into account.
+Airflow is a platform that lets you build and run *workflows*. A workflow is
represented as a
+:doc:`DAG <dags>` (a Directed Acyclic Graph), and contains individual pieces
of work called
+:doc:`tasks`, arranged with dependencies and data flows taken into account.
.. image:: ../img/edge_label_example.png
:alt: An example Airflow DAG, rendered in Graph
-A DAG specifies the dependencies between Tasks, and the order in which to
execute them and run retries; the Tasks themselves describe what to do, be it
fetching data, running analysis, triggering other systems, or more.
+A DAG specifies the dependencies between Tasks, and the order in which to
execute them and run retries;
+the Tasks themselves describe what to do, be it fetching data, running
analysis, triggering other systems,
+or more.
-An Airflow installation generally consists of the following components:
+Airflow components
+------------------
-* A :doc:`scheduler <../administration-and-deployment/scheduler>`, which
handles both triggering scheduled workflows, and submitting :doc:`tasks` to the
executor to run.
+Required components
+...................
-* An :doc:`executor <executor/index>`, which handles running tasks. In the
default Airflow installation, this runs everything *inside* the scheduler, but
most production-suitable executors actually push task execution out to
*workers*.
+Minimal Airflow installation consists of the following components:
-* A *triggerer*, which executes deferred tasks - executed in an async-io event
loop.
+* A :doc:`scheduler <../administration-and-deployment/scheduler>`, which
handles both triggering scheduled
+ workflows, and submitting :doc:`tasks` to the executor to run. The
:doc:`executor <executor/index>`, is
+ a configuration property of the *scheduler*, not a separate component and
runs within the scheduler
+ process. There are several executors available out of the box, and you can
also write your own.
-* A *webserver*, which presents a handy user interface to inspect, trigger and
debug the behaviour of DAGs and tasks.
+* A *webserver*, which presents a handy user interface to inspect, trigger and
debug the behaviour of
+ DAGs and tasks.
-* A folder of *DAG files*, read by the scheduler and executor (and any workers
the executor has)
+* A folder of *DAG files*, is read by the *scheduler* to figure out what tasks
to run and when and to
+ run them.
-* A *metadata database*, used by the scheduler, executor and webserver to
store state.
+* A *metadata database*, used by the *scheduler*, and *webserver* to store
state of workflows and tasks.
+ Setting up a metadata database is described in :doc:`/howto/set-up-database`
and is required for
+ Airflow to work.
+Optional components
+...................
-Basic airflow architecture
---------------------------
+There are also some optional components that are not present in the basic
installation
Review Comment:
```suggestion
Some Airflow components are optional and can enable better extensibility,
scalability, and performance in your Airflow:
```
Would explain _why_ they're optional here.
##########
docs/apache-airflow/core-concepts/overview.rst:
##########
@@ -18,49 +18,151 @@
Architecture Overview
=====================
-Airflow is a platform that lets you build and run *workflows*. A workflow is
represented as a :doc:`DAG <dags>` (a Directed Acyclic Graph), and contains
individual pieces of work called :doc:`tasks`, arranged with dependencies and
data flows taken into account.
+Airflow is a platform that lets you build and run *workflows*. A workflow is
represented as a
+:doc:`DAG <dags>` (a Directed Acyclic Graph), and contains individual pieces
of work called
+:doc:`tasks`, arranged with dependencies and data flows taken into account.
.. image:: ../img/edge_label_example.png
:alt: An example Airflow DAG, rendered in Graph
-A DAG specifies the dependencies between Tasks, and the order in which to
execute them and run retries; the Tasks themselves describe what to do, be it
fetching data, running analysis, triggering other systems, or more.
+A DAG specifies the dependencies between Tasks, and the order in which to
execute them and run retries;
+the Tasks themselves describe what to do, be it fetching data, running
analysis, triggering other systems,
+or more.
-An Airflow installation generally consists of the following components:
+Airflow components
+------------------
-* A :doc:`scheduler <../administration-and-deployment/scheduler>`, which
handles both triggering scheduled workflows, and submitting :doc:`tasks` to the
executor to run.
+Required components
+...................
-* An :doc:`executor <executor/index>`, which handles running tasks. In the
default Airflow installation, this runs everything *inside* the scheduler, but
most production-suitable executors actually push task execution out to
*workers*.
+Minimal Airflow installation consists of the following components:
-* A *triggerer*, which executes deferred tasks - executed in an async-io event
loop.
+* A :doc:`scheduler <../administration-and-deployment/scheduler>`, which
handles both triggering scheduled
+ workflows, and submitting :doc:`tasks` to the executor to run. The
:doc:`executor <executor/index>`, is
+ a configuration property of the *scheduler*, not a separate component and
runs within the scheduler
+ process. There are several executors available out of the box, and you can
also write your own.
-* A *webserver*, which presents a handy user interface to inspect, trigger and
debug the behaviour of DAGs and tasks.
+* A *webserver*, which presents a handy user interface to inspect, trigger and
debug the behaviour of
+ DAGs and tasks.
-* A folder of *DAG files*, read by the scheduler and executor (and any workers
the executor has)
+* A folder of *DAG files*, is read by the *scheduler* to figure out what tasks
to run and when and to
+ run them.
-* A *metadata database*, used by the scheduler, executor and webserver to
store state.
+* A *metadata database*, used by the *scheduler*, and *webserver* to store
state of workflows and tasks.
+ Setting up a metadata database is described in :doc:`/howto/set-up-database`
and is required for
+ Airflow to work.
+Optional components
+...................
-Basic airflow architecture
---------------------------
+There are also some optional components that are not present in the basic
installation
-This is the basic architecture of Airflow that you'll see in simple
installations:
+* Optional *worker*, which executes the tasks given to it by the scheduler. In
the basic installation
+ worker might be part of the scheduler not a separate component. It can be
run as a long running process
+ in the :doc:`CeleryExecutor <executor/celery>`, or as a POD in the
+ :doc:`KubernetesExecutor <executor/kubernetes>`.
+
+* Optional *triggerer*, which executes deferred tasks in an async-io event
loop. In basic installation
+ where deferred tasks are not used, triggerer might not be present. More
about deferring tasks can be
+ found in :doc:`/authoring-and-scheduling/deferring`.
+
+* Optional *dag processor*, which parses DAG files and synchronizes them into
the
+ *metadata database* in basic installation *dag processor* might be part of
the scheduler not
+ a separate component.
+
+* A folder of *DAG files*, is read by *dag processor*, *workers* and
*triggerer* when they are running.
+ If *dag processor* is present *scheduler** does not need to read the *DAG
files* directly. More about
+ processing DAG files can be found in
:doc:`/authoring-and-scheduling/dagfile-processing`
Review Comment:
IMO it's confusing this is mentioned in both the required and optional
components. Would leave it only in the required components, because without DAG
folder you have no functioning Airflow.
##########
docs/apache-airflow/core-concepts/overview.rst:
##########
@@ -18,49 +18,151 @@
Architecture Overview
=====================
-Airflow is a platform that lets you build and run *workflows*. A workflow is
represented as a :doc:`DAG <dags>` (a Directed Acyclic Graph), and contains
individual pieces of work called :doc:`tasks`, arranged with dependencies and
data flows taken into account.
+Airflow is a platform that lets you build and run *workflows*. A workflow is
represented as a
+:doc:`DAG <dags>` (a Directed Acyclic Graph), and contains individual pieces
of work called
+:doc:`tasks`, arranged with dependencies and data flows taken into account.
.. image:: ../img/edge_label_example.png
:alt: An example Airflow DAG, rendered in Graph
-A DAG specifies the dependencies between Tasks, and the order in which to
execute them and run retries; the Tasks themselves describe what to do, be it
fetching data, running analysis, triggering other systems, or more.
+A DAG specifies the dependencies between Tasks, and the order in which to
execute them and run retries;
+the Tasks themselves describe what to do, be it fetching data, running
analysis, triggering other systems,
+or more.
-An Airflow installation generally consists of the following components:
+Airflow components
+------------------
-* A :doc:`scheduler <../administration-and-deployment/scheduler>`, which
handles both triggering scheduled workflows, and submitting :doc:`tasks` to the
executor to run.
+Required components
+...................
-* An :doc:`executor <executor/index>`, which handles running tasks. In the
default Airflow installation, this runs everything *inside* the scheduler, but
most production-suitable executors actually push task execution out to
*workers*.
+Minimal Airflow installation consists of the following components:
-* A *triggerer*, which executes deferred tasks - executed in an async-io event
loop.
+* A :doc:`scheduler <../administration-and-deployment/scheduler>`, which
handles both triggering scheduled
+ workflows, and submitting :doc:`tasks` to the executor to run. The
:doc:`executor <executor/index>`, is
+ a configuration property of the *scheduler*, not a separate component and
runs within the scheduler
+ process. There are several executors available out of the box, and you can
also write your own.
-* A *webserver*, which presents a handy user interface to inspect, trigger and
debug the behaviour of DAGs and tasks.
+* A *webserver*, which presents a handy user interface to inspect, trigger and
debug the behaviour of
+ DAGs and tasks.
-* A folder of *DAG files*, read by the scheduler and executor (and any workers
the executor has)
+* A folder of *DAG files*, is read by the *scheduler* to figure out what tasks
to run and when and to
+ run them.
-* A *metadata database*, used by the scheduler, executor and webserver to
store state.
+* A *metadata database*, used by the *scheduler*, and *webserver* to store
state of workflows and tasks.
+ Setting up a metadata database is described in :doc:`/howto/set-up-database`
and is required for
+ Airflow to work.
+Optional components
+...................
-Basic airflow architecture
---------------------------
+There are also some optional components that are not present in the basic
installation
-This is the basic architecture of Airflow that you'll see in simple
installations:
+* Optional *worker*, which executes the tasks given to it by the scheduler. In
the basic installation
+ worker might be part of the scheduler not a separate component. It can be
run as a long running process
+ in the :doc:`CeleryExecutor <executor/celery>`, or as a POD in the
+ :doc:`KubernetesExecutor <executor/kubernetes>`.
+
+* Optional *triggerer*, which executes deferred tasks in an async-io event
loop. In basic installation
+ where deferred tasks are not used, triggerer might not be present. More
about deferring tasks can be
+ found in :doc:`/authoring-and-scheduling/deferring`.
+
+* Optional *dag processor*, which parses DAG files and synchronizes them into
the
+ *metadata database* in basic installation *dag processor* might be part of
the scheduler not
+ a separate component.
+
+* A folder of *DAG files*, is read by *dag processor*, *workers* and
*triggerer* when they are running.
+ If *dag processor* is present *scheduler** does not need to read the *DAG
files* directly. More about
+ processing DAG files can be found in
:doc:`/authoring-and-scheduling/dagfile-processing`
+
+* Optional folder of *plugins*. Plugins are a way to extend Airflow's
functionality and by placing Python
+ files in the plugins folder you can extend Airflow functionality (similarly
as via installed packages).
+ Plugins are read by the *scheduler*, *dag processor*, *triggerer* and
*webserver*. More about
+ plugins can be found in :doc:`/authoring-and-scheduling/plugins`.
+
+Deploying Airflow components
+............................
+
+All the components are Python applications that can be deployed using various
deployment mechanisms.
+
+They deployed Python applications can have extra *installed packages*
installed in their Python environment.
Review Comment:
By "deployed Python applications" you're referring to Airflow components,
right?
##########
docs/apache-airflow/core-concepts/overview.rst:
##########
@@ -18,49 +18,151 @@
Architecture Overview
=====================
-Airflow is a platform that lets you build and run *workflows*. A workflow is
represented as a :doc:`DAG <dags>` (a Directed Acyclic Graph), and contains
individual pieces of work called :doc:`tasks`, arranged with dependencies and
data flows taken into account.
+Airflow is a platform that lets you build and run *workflows*. A workflow is
represented as a
+:doc:`DAG <dags>` (a Directed Acyclic Graph), and contains individual pieces
of work called
+:doc:`tasks`, arranged with dependencies and data flows taken into account.
.. image:: ../img/edge_label_example.png
:alt: An example Airflow DAG, rendered in Graph
-A DAG specifies the dependencies between Tasks, and the order in which to
execute them and run retries; the Tasks themselves describe what to do, be it
fetching data, running analysis, triggering other systems, or more.
+A DAG specifies the dependencies between Tasks, and the order in which to
execute them and run retries;
+the Tasks themselves describe what to do, be it fetching data, running
analysis, triggering other systems,
+or more.
-An Airflow installation generally consists of the following components:
+Airflow components
+------------------
-* A :doc:`scheduler <../administration-and-deployment/scheduler>`, which
handles both triggering scheduled workflows, and submitting :doc:`tasks` to the
executor to run.
+Required components
+...................
-* An :doc:`executor <executor/index>`, which handles running tasks. In the
default Airflow installation, this runs everything *inside* the scheduler, but
most production-suitable executors actually push task execution out to
*workers*.
+Minimal Airflow installation consists of the following components:
-* A *triggerer*, which executes deferred tasks - executed in an async-io event
loop.
+* A :doc:`scheduler <../administration-and-deployment/scheduler>`, which
handles both triggering scheduled
+ workflows, and submitting :doc:`tasks` to the executor to run. The
:doc:`executor <executor/index>`, is
+ a configuration property of the *scheduler*, not a separate component and
runs within the scheduler
+ process. There are several executors available out of the box, and you can
also write your own.
-* A *webserver*, which presents a handy user interface to inspect, trigger and
debug the behaviour of DAGs and tasks.
+* A *webserver*, which presents a handy user interface to inspect, trigger and
debug the behaviour of
+ DAGs and tasks.
-* A folder of *DAG files*, read by the scheduler and executor (and any workers
the executor has)
+* A folder of *DAG files*, is read by the *scheduler* to figure out what tasks
to run and when and to
+ run them.
-* A *metadata database*, used by the scheduler, executor and webserver to
store state.
+* A *metadata database*, used by the *scheduler*, and *webserver* to store
state of workflows and tasks.
+ Setting up a metadata database is described in :doc:`/howto/set-up-database`
and is required for
+ Airflow to work.
+Optional components
+...................
-Basic airflow architecture
---------------------------
+There are also some optional components that are not present in the basic
installation
-This is the basic architecture of Airflow that you'll see in simple
installations:
+* Optional *worker*, which executes the tasks given to it by the scheduler. In
the basic installation
+ worker might be part of the scheduler not a separate component. It can be
run as a long running process
+ in the :doc:`CeleryExecutor <executor/celery>`, or as a POD in the
+ :doc:`KubernetesExecutor <executor/kubernetes>`.
+
+* Optional *triggerer*, which executes deferred tasks in an async-io event
loop. In basic installation
+ where deferred tasks are not used, triggerer might not be present. More
about deferring tasks can be
+ found in :doc:`/authoring-and-scheduling/deferring`.
+
+* Optional *dag processor*, which parses DAG files and synchronizes them into
the
+ *metadata database* in basic installation *dag processor* might be part of
the scheduler not
+ a separate component.
+
+* A folder of *DAG files*, is read by *dag processor*, *workers* and
*triggerer* when they are running.
+ If *dag processor* is present *scheduler** does not need to read the *DAG
files* directly. More about
+ processing DAG files can be found in
:doc:`/authoring-and-scheduling/dagfile-processing`
+
+* Optional folder of *plugins*. Plugins are a way to extend Airflow's
functionality and by placing Python
+ files in the plugins folder you can extend Airflow functionality (similarly
as via installed packages).
+ Plugins are read by the *scheduler*, *dag processor*, *triggerer* and
*webserver*. More about
+ plugins can be found in :doc:`/authoring-and-scheduling/plugins`.
+
+Deploying Airflow components
+............................
+
+All the components are Python applications that can be deployed using various
deployment mechanisms.
+
+They deployed Python applications can have extra *installed packages*
installed in their Python environment.
+This is useful for example to install custom operators or sensors or extend
Airflow functionality with custom
+plugins.
+
+While Airflow can be run in a single machine and with simple installation
where only *scheduler* and
+*webserver* are deployed, Airflow is designed to be scalable and secure, and
is able ot run in a distributed
+environment - where various components can run on different machines, with
different security perimeters
+and can be scaled by running multiple instances of the components above. Also
while single person can run and
+manage Airflow installation, Airflow Deployment in more complex setup can
involve various roles of users
+as described in the :doc:`/security/security_model`.
+
+Airflow itself is agnostic to what you're running - it will happily
orchestrate and run anything,
+either with high-level support from one of our providers, or directly as a
command using the shell
+or Python :doc:`operators`.
+
+Architecture Diagrams
+---------------------
+
+The diagrams below show different ways to deploy Airflow - gradually from the
simple "one machine" and
+single person deployment, to a more complex deployment with separate
components, separate user roles and
+finally with more isolated security perimeters.
+
+The meaning of the different connection types in the diagrams below is as
follows:
+
+* **brown solid lines** represent *DAG files* submission and synchronization
+* **blue solid lines** represent deploying and accessing *installed packages*
and *plugins*
+* **black dashed lines** represent control flow of workers by the *scheduler*
(via executor)
+* **black solid lines** represent accessing the UI to manage execution of the
workflows
+* **red dashed lines** represent accessing the *metadata database* by all
components
+
+Basic Airflow deployment
+........................
+
+This is the simplest deployment of Airflow, usually operated and managed on a
single
+machine. Such deployment usually uses Local Executor, where the *scheduler*
and the *workers* are in
+the same component and the *DAG files* are read directly from the local
filesystem by the *scheduler*
+and there is no separate *triggerer* to run deferred tasks and *webserver*
runs on the same machine
+as the *scheduler*.
+
+Such installation typically does not separate user roles - deployment,
configuration, operation, authoring
+and maintenance are all done by the same person and there are no security
perimeters between the components.
.. image:: ../img/diagram_basic_airflow_architecture.png
-Most executors will generally also introduce other components to let them talk
to their workers - like a task queue - but you can still think of the executor
and its workers as a single logical component in Airflow overall, handling the
actual task execution.
+If you want to run Airflow on a single machine in a simple single-machine
setup, you can skip the
+more complex diagrams below and go straight to the :ref:`overview:workloads`
section.
+
+Distributed Airflow architecture
+................................
+
+This is the architecture of Airflow where components of Airflow are
distributed among multiple machines
+and where various roles of users are introduced - **Deployment Manager**,
**DAG author**,
+**Operations User**. You can read more about those various roles in the
:doc:`/security/security_model`.
+
+In case of distributed deployment, it is much more important to consider
security aspects of the components.
Review Comment:
```suggestion
In the case of a distributed deployment, it is important to consider the
security aspects of the components.
```
##########
docs/apache-airflow/core-concepts/overview.rst:
##########
@@ -18,49 +18,151 @@
Architecture Overview
=====================
-Airflow is a platform that lets you build and run *workflows*. A workflow is
represented as a :doc:`DAG <dags>` (a Directed Acyclic Graph), and contains
individual pieces of work called :doc:`tasks`, arranged with dependencies and
data flows taken into account.
+Airflow is a platform that lets you build and run *workflows*. A workflow is
represented as a
+:doc:`DAG <dags>` (a Directed Acyclic Graph), and contains individual pieces
of work called
+:doc:`tasks`, arranged with dependencies and data flows taken into account.
.. image:: ../img/edge_label_example.png
:alt: An example Airflow DAG, rendered in Graph
-A DAG specifies the dependencies between Tasks, and the order in which to
execute them and run retries; the Tasks themselves describe what to do, be it
fetching data, running analysis, triggering other systems, or more.
+A DAG specifies the dependencies between Tasks, and the order in which to
execute them and run retries;
+the Tasks themselves describe what to do, be it fetching data, running
analysis, triggering other systems,
+or more.
-An Airflow installation generally consists of the following components:
+Airflow components
+------------------
-* A :doc:`scheduler <../administration-and-deployment/scheduler>`, which
handles both triggering scheduled workflows, and submitting :doc:`tasks` to the
executor to run.
+Required components
+...................
-* An :doc:`executor <executor/index>`, which handles running tasks. In the
default Airflow installation, this runs everything *inside* the scheduler, but
most production-suitable executors actually push task execution out to
*workers*.
+Minimal Airflow installation consists of the following components:
-* A *triggerer*, which executes deferred tasks - executed in an async-io event
loop.
+* A :doc:`scheduler <../administration-and-deployment/scheduler>`, which
handles both triggering scheduled
+ workflows, and submitting :doc:`tasks` to the executor to run. The
:doc:`executor <executor/index>`, is
+ a configuration property of the *scheduler*, not a separate component and
runs within the scheduler
+ process. There are several executors available out of the box, and you can
also write your own.
-* A *webserver*, which presents a handy user interface to inspect, trigger and
debug the behaviour of DAGs and tasks.
+* A *webserver*, which presents a handy user interface to inspect, trigger and
debug the behaviour of
+ DAGs and tasks.
-* A folder of *DAG files*, read by the scheduler and executor (and any workers
the executor has)
+* A folder of *DAG files*, is read by the *scheduler* to figure out what tasks
to run and when and to
+ run them.
-* A *metadata database*, used by the scheduler, executor and webserver to
store state.
+* A *metadata database*, used by the *scheduler*, and *webserver* to store
state of workflows and tasks.
+ Setting up a metadata database is described in :doc:`/howto/set-up-database`
and is required for
+ Airflow to work.
+Optional components
+...................
-Basic airflow architecture
---------------------------
+There are also some optional components that are not present in the basic
installation
-This is the basic architecture of Airflow that you'll see in simple
installations:
+* Optional *worker*, which executes the tasks given to it by the scheduler. In
the basic installation
+ worker might be part of the scheduler not a separate component. It can be
run as a long running process
+ in the :doc:`CeleryExecutor <executor/celery>`, or as a POD in the
+ :doc:`KubernetesExecutor <executor/kubernetes>`.
+
+* Optional *triggerer*, which executes deferred tasks in an async-io event
loop. In basic installation
+ where deferred tasks are not used, triggerer might not be present. More
about deferring tasks can be
+ found in :doc:`/authoring-and-scheduling/deferring`.
+
+* Optional *dag processor*, which parses DAG files and synchronizes them into
the
+ *metadata database* in basic installation *dag processor* might be part of
the scheduler not
+ a separate component.
+
+* A folder of *DAG files*, is read by *dag processor*, *workers* and
*triggerer* when they are running.
+ If *dag processor* is present *scheduler** does not need to read the *DAG
files* directly. More about
+ processing DAG files can be found in
:doc:`/authoring-and-scheduling/dagfile-processing`
+
+* Optional folder of *plugins*. Plugins are a way to extend Airflow's
functionality and by placing Python
+ files in the plugins folder you can extend Airflow functionality (similarly
as via installed packages).
+ Plugins are read by the *scheduler*, *dag processor*, *triggerer* and
*webserver*. More about
+ plugins can be found in :doc:`/authoring-and-scheduling/plugins`.
+
+Deploying Airflow components
+............................
Review Comment:
Let's make this header a different style because it's now on the same level
as "Required components" and "Optional components".
##########
docs/apache-airflow/core-concepts/overview.rst:
##########
@@ -18,49 +18,151 @@
Architecture Overview
=====================
-Airflow is a platform that lets you build and run *workflows*. A workflow is
represented as a :doc:`DAG <dags>` (a Directed Acyclic Graph), and contains
individual pieces of work called :doc:`tasks`, arranged with dependencies and
data flows taken into account.
+Airflow is a platform that lets you build and run *workflows*. A workflow is
represented as a
+:doc:`DAG <dags>` (a Directed Acyclic Graph), and contains individual pieces
of work called
+:doc:`tasks`, arranged with dependencies and data flows taken into account.
.. image:: ../img/edge_label_example.png
:alt: An example Airflow DAG, rendered in Graph
-A DAG specifies the dependencies between Tasks, and the order in which to
execute them and run retries; the Tasks themselves describe what to do, be it
fetching data, running analysis, triggering other systems, or more.
+A DAG specifies the dependencies between Tasks, and the order in which to
execute them and run retries;
+the Tasks themselves describe what to do, be it fetching data, running
analysis, triggering other systems,
+or more.
-An Airflow installation generally consists of the following components:
+Airflow components
+------------------
-* A :doc:`scheduler <../administration-and-deployment/scheduler>`, which
handles both triggering scheduled workflows, and submitting :doc:`tasks` to the
executor to run.
+Required components
+...................
-* An :doc:`executor <executor/index>`, which handles running tasks. In the
default Airflow installation, this runs everything *inside* the scheduler, but
most production-suitable executors actually push task execution out to
*workers*.
+Minimal Airflow installation consists of the following components:
-* A *triggerer*, which executes deferred tasks - executed in an async-io event
loop.
+* A :doc:`scheduler <../administration-and-deployment/scheduler>`, which
handles both triggering scheduled
+ workflows, and submitting :doc:`tasks` to the executor to run. The
:doc:`executor <executor/index>`, is
+ a configuration property of the *scheduler*, not a separate component and
runs within the scheduler
+ process. There are several executors available out of the box, and you can
also write your own.
-* A *webserver*, which presents a handy user interface to inspect, trigger and
debug the behaviour of DAGs and tasks.
+* A *webserver*, which presents a handy user interface to inspect, trigger and
debug the behaviour of
+ DAGs and tasks.
-* A folder of *DAG files*, read by the scheduler and executor (and any workers
the executor has)
+* A folder of *DAG files*, is read by the *scheduler* to figure out what tasks
to run and when and to
+ run them.
-* A *metadata database*, used by the scheduler, executor and webserver to
store state.
+* A *metadata database*, used by the *scheduler*, and *webserver* to store
state of workflows and tasks.
+ Setting up a metadata database is described in :doc:`/howto/set-up-database`
and is required for
+ Airflow to work.
+Optional components
+...................
-Basic airflow architecture
---------------------------
+There are also some optional components that are not present in the basic
installation
-This is the basic architecture of Airflow that you'll see in simple
installations:
+* Optional *worker*, which executes the tasks given to it by the scheduler. In
the basic installation
+ worker might be part of the scheduler not a separate component. It can be
run as a long running process
+ in the :doc:`CeleryExecutor <executor/celery>`, or as a POD in the
+ :doc:`KubernetesExecutor <executor/kubernetes>`.
+
+* Optional *triggerer*, which executes deferred tasks in an async-io event
loop. In basic installation
+ where deferred tasks are not used, triggerer might not be present. More
about deferring tasks can be
+ found in :doc:`/authoring-and-scheduling/deferring`.
+
+* Optional *dag processor*, which parses DAG files and synchronizes them into
the
+ *metadata database* in basic installation *dag processor* might be part of
the scheduler not
+ a separate component.
Review Comment:
```suggestion
* Optional *dag processor*, which parses DAG files and serializes them into
the
*metadata database*. By default, the *dag processor* process is part of
the scheduler, but it can be run as a separate component for scalability
reasons.
```
##########
docs/apache-airflow/core-concepts/overview.rst:
##########
@@ -18,49 +18,151 @@
Architecture Overview
=====================
-Airflow is a platform that lets you build and run *workflows*. A workflow is
represented as a :doc:`DAG <dags>` (a Directed Acyclic Graph), and contains
individual pieces of work called :doc:`tasks`, arranged with dependencies and
data flows taken into account.
+Airflow is a platform that lets you build and run *workflows*. A workflow is
represented as a
+:doc:`DAG <dags>` (a Directed Acyclic Graph), and contains individual pieces
of work called
+:doc:`tasks`, arranged with dependencies and data flows taken into account.
.. image:: ../img/edge_label_example.png
:alt: An example Airflow DAG, rendered in Graph
-A DAG specifies the dependencies between Tasks, and the order in which to
execute them and run retries; the Tasks themselves describe what to do, be it
fetching data, running analysis, triggering other systems, or more.
+A DAG specifies the dependencies between Tasks, and the order in which to
execute them and run retries;
+the Tasks themselves describe what to do, be it fetching data, running
analysis, triggering other systems,
+or more.
-An Airflow installation generally consists of the following components:
+Airflow components
+------------------
-* A :doc:`scheduler <../administration-and-deployment/scheduler>`, which
handles both triggering scheduled workflows, and submitting :doc:`tasks` to the
executor to run.
+Required components
+...................
-* An :doc:`executor <executor/index>`, which handles running tasks. In the
default Airflow installation, this runs everything *inside* the scheduler, but
most production-suitable executors actually push task execution out to
*workers*.
+Minimal Airflow installation consists of the following components:
-* A *triggerer*, which executes deferred tasks - executed in an async-io event
loop.
+* A :doc:`scheduler <../administration-and-deployment/scheduler>`, which
handles both triggering scheduled
+ workflows, and submitting :doc:`tasks` to the executor to run. The
:doc:`executor <executor/index>`, is
+ a configuration property of the *scheduler*, not a separate component and
runs within the scheduler
+ process. There are several executors available out of the box, and you can
also write your own.
-* A *webserver*, which presents a handy user interface to inspect, trigger and
debug the behaviour of DAGs and tasks.
+* A *webserver*, which presents a handy user interface to inspect, trigger and
debug the behaviour of
+ DAGs and tasks.
-* A folder of *DAG files*, read by the scheduler and executor (and any workers
the executor has)
+* A folder of *DAG files*, is read by the *scheduler* to figure out what tasks
to run and when and to
Review Comment:
```suggestion
* A folder of *DAG files* is read by the *scheduler* to figure out what
tasks to run and when and to
```
##########
docs/apache-airflow/core-concepts/overview.rst:
##########
@@ -18,49 +18,151 @@
Architecture Overview
=====================
-Airflow is a platform that lets you build and run *workflows*. A workflow is
represented as a :doc:`DAG <dags>` (a Directed Acyclic Graph), and contains
individual pieces of work called :doc:`tasks`, arranged with dependencies and
data flows taken into account.
+Airflow is a platform that lets you build and run *workflows*. A workflow is
represented as a
+:doc:`DAG <dags>` (a Directed Acyclic Graph), and contains individual pieces
of work called
+:doc:`tasks`, arranged with dependencies and data flows taken into account.
.. image:: ../img/edge_label_example.png
:alt: An example Airflow DAG, rendered in Graph
-A DAG specifies the dependencies between Tasks, and the order in which to
execute them and run retries; the Tasks themselves describe what to do, be it
fetching data, running analysis, triggering other systems, or more.
+A DAG specifies the dependencies between Tasks, and the order in which to
execute them and run retries;
+the Tasks themselves describe what to do, be it fetching data, running
analysis, triggering other systems,
+or more.
-An Airflow installation generally consists of the following components:
+Airflow components
+------------------
-* A :doc:`scheduler <../administration-and-deployment/scheduler>`, which
handles both triggering scheduled workflows, and submitting :doc:`tasks` to the
executor to run.
+Required components
+...................
-* An :doc:`executor <executor/index>`, which handles running tasks. In the
default Airflow installation, this runs everything *inside* the scheduler, but
most production-suitable executors actually push task execution out to
*workers*.
+Minimal Airflow installation consists of the following components:
-* A *triggerer*, which executes deferred tasks - executed in an async-io event
loop.
+* A :doc:`scheduler <../administration-and-deployment/scheduler>`, which
handles both triggering scheduled
+ workflows, and submitting :doc:`tasks` to the executor to run. The
:doc:`executor <executor/index>`, is
+ a configuration property of the *scheduler*, not a separate component and
runs within the scheduler
+ process. There are several executors available out of the box, and you can
also write your own.
-* A *webserver*, which presents a handy user interface to inspect, trigger and
debug the behaviour of DAGs and tasks.
+* A *webserver*, which presents a handy user interface to inspect, trigger and
debug the behaviour of
+ DAGs and tasks.
-* A folder of *DAG files*, read by the scheduler and executor (and any workers
the executor has)
+* A folder of *DAG files*, is read by the *scheduler* to figure out what tasks
to run and when and to
+ run them.
-* A *metadata database*, used by the scheduler, executor and webserver to
store state.
+* A *metadata database*, used by the *scheduler*, and *webserver* to store
state of workflows and tasks.
+ Setting up a metadata database is described in :doc:`/howto/set-up-database`
and is required for
+ Airflow to work.
+Optional components
+...................
-Basic airflow architecture
---------------------------
+There are also some optional components that are not present in the basic
installation
-This is the basic architecture of Airflow that you'll see in simple
installations:
+* Optional *worker*, which executes the tasks given to it by the scheduler. In
the basic installation
+ worker might be part of the scheduler not a separate component. It can be
run as a long running process
+ in the :doc:`CeleryExecutor <executor/celery>`, or as a POD in the
+ :doc:`KubernetesExecutor <executor/kubernetes>`.
+
+* Optional *triggerer*, which executes deferred tasks in an async-io event
loop. In basic installation
+ where deferred tasks are not used, triggerer might not be present. More
about deferring tasks can be
Review Comment:
```suggestion
where deferred tasks are not used, a triggerer is not necessary. More
about deferring tasks can be
```
##########
docs/apache-airflow/core-concepts/overview.rst:
##########
@@ -18,49 +18,151 @@
Architecture Overview
=====================
-Airflow is a platform that lets you build and run *workflows*. A workflow is
represented as a :doc:`DAG <dags>` (a Directed Acyclic Graph), and contains
individual pieces of work called :doc:`tasks`, arranged with dependencies and
data flows taken into account.
+Airflow is a platform that lets you build and run *workflows*. A workflow is
represented as a
+:doc:`DAG <dags>` (a Directed Acyclic Graph), and contains individual pieces
of work called
+:doc:`tasks`, arranged with dependencies and data flows taken into account.
.. image:: ../img/edge_label_example.png
:alt: An example Airflow DAG, rendered in Graph
-A DAG specifies the dependencies between Tasks, and the order in which to
execute them and run retries; the Tasks themselves describe what to do, be it
fetching data, running analysis, triggering other systems, or more.
+A DAG specifies the dependencies between Tasks, and the order in which to
execute them and run retries;
+the Tasks themselves describe what to do, be it fetching data, running
analysis, triggering other systems,
+or more.
-An Airflow installation generally consists of the following components:
+Airflow components
+------------------
-* A :doc:`scheduler <../administration-and-deployment/scheduler>`, which
handles both triggering scheduled workflows, and submitting :doc:`tasks` to the
executor to run.
+Required components
+...................
-* An :doc:`executor <executor/index>`, which handles running tasks. In the
default Airflow installation, this runs everything *inside* the scheduler, but
most production-suitable executors actually push task execution out to
*workers*.
+Minimal Airflow installation consists of the following components:
-* A *triggerer*, which executes deferred tasks - executed in an async-io event
loop.
+* A :doc:`scheduler <../administration-and-deployment/scheduler>`, which
handles both triggering scheduled
+ workflows, and submitting :doc:`tasks` to the executor to run. The
:doc:`executor <executor/index>`, is
+ a configuration property of the *scheduler*, not a separate component and
runs within the scheduler
+ process. There are several executors available out of the box, and you can
also write your own.
-* A *webserver*, which presents a handy user interface to inspect, trigger and
debug the behaviour of DAGs and tasks.
+* A *webserver*, which presents a handy user interface to inspect, trigger and
debug the behaviour of
+ DAGs and tasks.
-* A folder of *DAG files*, read by the scheduler and executor (and any workers
the executor has)
+* A folder of *DAG files*, is read by the *scheduler* to figure out what tasks
to run and when and to
+ run them.
-* A *metadata database*, used by the scheduler, executor and webserver to
store state.
+* A *metadata database*, used by the *scheduler*, and *webserver* to store
state of workflows and tasks.
+ Setting up a metadata database is described in :doc:`/howto/set-up-database`
and is required for
+ Airflow to work.
+Optional components
+...................
-Basic airflow architecture
---------------------------
+There are also some optional components that are not present in the basic
installation
-This is the basic architecture of Airflow that you'll see in simple
installations:
+* Optional *worker*, which executes the tasks given to it by the scheduler. In
the basic installation
+ worker might be part of the scheduler not a separate component. It can be
run as a long running process
+ in the :doc:`CeleryExecutor <executor/celery>`, or as a POD in the
+ :doc:`KubernetesExecutor <executor/kubernetes>`.
+
+* Optional *triggerer*, which executes deferred tasks in an async-io event
loop. In basic installation
+ where deferred tasks are not used, triggerer might not be present. More
about deferring tasks can be
+ found in :doc:`/authoring-and-scheduling/deferring`.
+
+* Optional *dag processor*, which parses DAG files and synchronizes them into
the
+ *metadata database* in basic installation *dag processor* might be part of
the scheduler not
+ a separate component.
+
+* A folder of *DAG files*, is read by *dag processor*, *workers* and
*triggerer* when they are running.
+ If *dag processor* is present *scheduler** does not need to read the *DAG
files* directly. More about
+ processing DAG files can be found in
:doc:`/authoring-and-scheduling/dagfile-processing`
+
+* Optional folder of *plugins*. Plugins are a way to extend Airflow's
functionality and by placing Python
+ files in the plugins folder you can extend Airflow functionality (similarly
as via installed packages).
+ Plugins are read by the *scheduler*, *dag processor*, *triggerer* and
*webserver*. More about
+ plugins can be found in :doc:`/authoring-and-scheduling/plugins`.
+
+Deploying Airflow components
+............................
+
+All the components are Python applications that can be deployed using various
deployment mechanisms.
+
+They deployed Python applications can have extra *installed packages*
installed in their Python environment.
+This is useful for example to install custom operators or sensors or extend
Airflow functionality with custom
+plugins.
+
+While Airflow can be run in a single machine and with simple installation
where only *scheduler* and
+*webserver* are deployed, Airflow is designed to be scalable and secure, and
is able ot run in a distributed
+environment - where various components can run on different machines, with
different security perimeters
+and can be scaled by running multiple instances of the components above. Also
while single person can run and
+manage Airflow installation, Airflow Deployment in more complex setup can
involve various roles of users
+as described in the :doc:`/security/security_model`.
Review Comment:
I'm confused about the key message of this paragraph. What is the key
message here? "Airflow is scalable"? "Airflow supports roles"?
##########
docs/apache-airflow/core-concepts/overview.rst:
##########
@@ -18,49 +18,151 @@
Architecture Overview
=====================
-Airflow is a platform that lets you build and run *workflows*. A workflow is
represented as a :doc:`DAG <dags>` (a Directed Acyclic Graph), and contains
individual pieces of work called :doc:`tasks`, arranged with dependencies and
data flows taken into account.
+Airflow is a platform that lets you build and run *workflows*. A workflow is
represented as a
+:doc:`DAG <dags>` (a Directed Acyclic Graph), and contains individual pieces
of work called
+:doc:`tasks`, arranged with dependencies and data flows taken into account.
.. image:: ../img/edge_label_example.png
:alt: An example Airflow DAG, rendered in Graph
-A DAG specifies the dependencies between Tasks, and the order in which to
execute them and run retries; the Tasks themselves describe what to do, be it
fetching data, running analysis, triggering other systems, or more.
+A DAG specifies the dependencies between Tasks, and the order in which to
execute them and run retries;
+the Tasks themselves describe what to do, be it fetching data, running
analysis, triggering other systems,
+or more.
-An Airflow installation generally consists of the following components:
+Airflow components
+------------------
-* A :doc:`scheduler <../administration-and-deployment/scheduler>`, which
handles both triggering scheduled workflows, and submitting :doc:`tasks` to the
executor to run.
+Required components
+...................
-* An :doc:`executor <executor/index>`, which handles running tasks. In the
default Airflow installation, this runs everything *inside* the scheduler, but
most production-suitable executors actually push task execution out to
*workers*.
+Minimal Airflow installation consists of the following components:
-* A *triggerer*, which executes deferred tasks - executed in an async-io event
loop.
+* A :doc:`scheduler <../administration-and-deployment/scheduler>`, which
handles both triggering scheduled
+ workflows, and submitting :doc:`tasks` to the executor to run. The
:doc:`executor <executor/index>`, is
+ a configuration property of the *scheduler*, not a separate component and
runs within the scheduler
+ process. There are several executors available out of the box, and you can
also write your own.
-* A *webserver*, which presents a handy user interface to inspect, trigger and
debug the behaviour of DAGs and tasks.
+* A *webserver*, which presents a handy user interface to inspect, trigger and
debug the behaviour of
+ DAGs and tasks.
-* A folder of *DAG files*, read by the scheduler and executor (and any workers
the executor has)
+* A folder of *DAG files*, is read by the *scheduler* to figure out what tasks
to run and when and to
+ run them.
-* A *metadata database*, used by the scheduler, executor and webserver to
store state.
+* A *metadata database*, used by the *scheduler*, and *webserver* to store
state of workflows and tasks.
+ Setting up a metadata database is described in :doc:`/howto/set-up-database`
and is required for
+ Airflow to work.
+Optional components
+...................
-Basic airflow architecture
---------------------------
+There are also some optional components that are not present in the basic
installation
-This is the basic architecture of Airflow that you'll see in simple
installations:
+* Optional *worker*, which executes the tasks given to it by the scheduler. In
the basic installation
+ worker might be part of the scheduler not a separate component. It can be
run as a long running process
+ in the :doc:`CeleryExecutor <executor/celery>`, or as a POD in the
+ :doc:`KubernetesExecutor <executor/kubernetes>`.
+
+* Optional *triggerer*, which executes deferred tasks in an async-io event
loop. In basic installation
+ where deferred tasks are not used, triggerer might not be present. More
about deferring tasks can be
+ found in :doc:`/authoring-and-scheduling/deferring`.
+
+* Optional *dag processor*, which parses DAG files and synchronizes them into
the
+ *metadata database* in basic installation *dag processor* might be part of
the scheduler not
+ a separate component.
+
+* A folder of *DAG files*, is read by *dag processor*, *workers* and
*triggerer* when they are running.
+ If *dag processor* is present *scheduler** does not need to read the *DAG
files* directly. More about
+ processing DAG files can be found in
:doc:`/authoring-and-scheduling/dagfile-processing`
+
+* Optional folder of *plugins*. Plugins are a way to extend Airflow's
functionality and by placing Python
+ files in the plugins folder you can extend Airflow functionality (similarly
as via installed packages).
+ Plugins are read by the *scheduler*, *dag processor*, *triggerer* and
*webserver*. More about
+ plugins can be found in :doc:`/authoring-and-scheduling/plugins`.
+
+Deploying Airflow components
+............................
+
+All the components are Python applications that can be deployed using various
deployment mechanisms.
+
+They deployed Python applications can have extra *installed packages*
installed in their Python environment.
+This is useful for example to install custom operators or sensors or extend
Airflow functionality with custom
+plugins.
+
+While Airflow can be run in a single machine and with simple installation
where only *scheduler* and
+*webserver* are deployed, Airflow is designed to be scalable and secure, and
is able ot run in a distributed
+environment - where various components can run on different machines, with
different security perimeters
+and can be scaled by running multiple instances of the components above. Also
while single person can run and
+manage Airflow installation, Airflow Deployment in more complex setup can
involve various roles of users
+as described in the :doc:`/security/security_model`.
+
+Airflow itself is agnostic to what you're running - it will happily
orchestrate and run anything,
+either with high-level support from one of our providers, or directly as a
command using the shell
+or Python :doc:`operators`.
+
+Architecture Diagrams
+---------------------
+
+The diagrams below show different ways to deploy Airflow - gradually from the
simple "one machine" and
+single person deployment, to a more complex deployment with separate
components, separate user roles and
+finally with more isolated security perimeters.
+
+The meaning of the different connection types in the diagrams below is as
follows:
+
+* **brown solid lines** represent *DAG files* submission and synchronization
+* **blue solid lines** represent deploying and accessing *installed packages*
and *plugins*
+* **black dashed lines** represent control flow of workers by the *scheduler*
(via executor)
+* **black solid lines** represent accessing the UI to manage execution of the
workflows
+* **red dashed lines** represent accessing the *metadata database* by all
components
+
+Basic Airflow deployment
+........................
+
+This is the simplest deployment of Airflow, usually operated and managed on a
single
+machine. Such deployment usually uses Local Executor, where the *scheduler*
and the *workers* are in
+the same component and the *DAG files* are read directly from the local
filesystem by the *scheduler*
+and there is no separate *triggerer* to run deferred tasks and *webserver*
runs on the same machine
+as the *scheduler*.
+
+Such installation typically does not separate user roles - deployment,
configuration, operation, authoring
+and maintenance are all done by the same person and there are no security
perimeters between the components.
.. image:: ../img/diagram_basic_airflow_architecture.png
-Most executors will generally also introduce other components to let them talk
to their workers - like a task queue - but you can still think of the executor
and its workers as a single logical component in Airflow overall, handling the
actual task execution.
+If you want to run Airflow on a single machine in a simple single-machine
setup, you can skip the
+more complex diagrams below and go straight to the :ref:`overview:workloads`
section.
+
+Distributed Airflow architecture
+................................
+
+This is the architecture of Airflow where components of Airflow are
distributed among multiple machines
+and where various roles of users are introduced - **Deployment Manager**,
**DAG author**,
+**Operations User**. You can read more about those various roles in the
:doc:`/security/security_model`.
+
+In case of distributed deployment, it is much more important to consider
security aspects of the components.
+The *webserver* does not have access to the *DAG files* directly (the code you
see in the Code tab of the
+UI is synchronized via the *metadata database*). This way the *webserver*
cannot execute any
+code submitted by **DAG author**. It can only execute code that is installed
as *installed packages* or
+*plugins* by the **Deployment Manager**. Also the **Operations User** has only
access to the UI and can
+only trigger DAGs and tasks, but cannot author DAGs.
+
+The *DAG files* need to be synchronized between all the components that use
them - *scheduler*,
+*triggerer* and *workers*. The *DAG files* can be synchronized by various
mechanisms - for example by
+using a shared filesystem, or by using a version control system like Git.
Review Comment:
> or by using a version control system like Git
I think that doesn't cover the full message - the user needs to set up Git
sync for that to work.
##########
docs/apache-airflow/core-concepts/overview.rst:
##########
@@ -18,49 +18,151 @@
Architecture Overview
=====================
-Airflow is a platform that lets you build and run *workflows*. A workflow is
represented as a :doc:`DAG <dags>` (a Directed Acyclic Graph), and contains
individual pieces of work called :doc:`tasks`, arranged with dependencies and
data flows taken into account.
+Airflow is a platform that lets you build and run *workflows*. A workflow is
represented as a
+:doc:`DAG <dags>` (a Directed Acyclic Graph), and contains individual pieces
of work called
+:doc:`tasks`, arranged with dependencies and data flows taken into account.
.. image:: ../img/edge_label_example.png
:alt: An example Airflow DAG, rendered in Graph
-A DAG specifies the dependencies between Tasks, and the order in which to
execute them and run retries; the Tasks themselves describe what to do, be it
fetching data, running analysis, triggering other systems, or more.
+A DAG specifies the dependencies between Tasks, and the order in which to
execute them and run retries;
+the Tasks themselves describe what to do, be it fetching data, running
analysis, triggering other systems,
+or more.
-An Airflow installation generally consists of the following components:
+Airflow components
+------------------
-* A :doc:`scheduler <../administration-and-deployment/scheduler>`, which
handles both triggering scheduled workflows, and submitting :doc:`tasks` to the
executor to run.
+Required components
+...................
-* An :doc:`executor <executor/index>`, which handles running tasks. In the
default Airflow installation, this runs everything *inside* the scheduler, but
most production-suitable executors actually push task execution out to
*workers*.
+Minimal Airflow installation consists of the following components:
-* A *triggerer*, which executes deferred tasks - executed in an async-io event
loop.
+* A :doc:`scheduler <../administration-and-deployment/scheduler>`, which
handles both triggering scheduled
+ workflows, and submitting :doc:`tasks` to the executor to run. The
:doc:`executor <executor/index>`, is
+ a configuration property of the *scheduler*, not a separate component and
runs within the scheduler
+ process. There are several executors available out of the box, and you can
also write your own.
-* A *webserver*, which presents a handy user interface to inspect, trigger and
debug the behaviour of DAGs and tasks.
+* A *webserver*, which presents a handy user interface to inspect, trigger and
debug the behaviour of
+ DAGs and tasks.
-* A folder of *DAG files*, read by the scheduler and executor (and any workers
the executor has)
+* A folder of *DAG files*, is read by the *scheduler* to figure out what tasks
to run and when and to
+ run them.
-* A *metadata database*, used by the scheduler, executor and webserver to
store state.
+* A *metadata database*, used by the *scheduler*, and *webserver* to store
state of workflows and tasks.
Review Comment:
```suggestion
* A *metadata database*, used by the *scheduler* and *webserver*, to store
state of workflows and tasks.
```
Shouldn't we just say "all Airflow components"?
##########
docs/apache-airflow/core-concepts/overview.rst:
##########
@@ -18,49 +18,151 @@
Architecture Overview
=====================
-Airflow is a platform that lets you build and run *workflows*. A workflow is
represented as a :doc:`DAG <dags>` (a Directed Acyclic Graph), and contains
individual pieces of work called :doc:`tasks`, arranged with dependencies and
data flows taken into account.
+Airflow is a platform that lets you build and run *workflows*. A workflow is
represented as a
+:doc:`DAG <dags>` (a Directed Acyclic Graph), and contains individual pieces
of work called
+:doc:`tasks`, arranged with dependencies and data flows taken into account.
.. image:: ../img/edge_label_example.png
:alt: An example Airflow DAG, rendered in Graph
-A DAG specifies the dependencies between Tasks, and the order in which to
execute them and run retries; the Tasks themselves describe what to do, be it
fetching data, running analysis, triggering other systems, or more.
+A DAG specifies the dependencies between Tasks, and the order in which to
execute them and run retries;
+the Tasks themselves describe what to do, be it fetching data, running
analysis, triggering other systems,
+or more.
-An Airflow installation generally consists of the following components:
+Airflow components
+------------------
-* A :doc:`scheduler <../administration-and-deployment/scheduler>`, which
handles both triggering scheduled workflows, and submitting :doc:`tasks` to the
executor to run.
+Required components
+...................
-* An :doc:`executor <executor/index>`, which handles running tasks. In the
default Airflow installation, this runs everything *inside* the scheduler, but
most production-suitable executors actually push task execution out to
*workers*.
+Minimal Airflow installation consists of the following components:
-* A *triggerer*, which executes deferred tasks - executed in an async-io event
loop.
+* A :doc:`scheduler <../administration-and-deployment/scheduler>`, which
handles both triggering scheduled
+ workflows, and submitting :doc:`tasks` to the executor to run. The
:doc:`executor <executor/index>`, is
+ a configuration property of the *scheduler*, not a separate component and
runs within the scheduler
+ process. There are several executors available out of the box, and you can
also write your own.
-* A *webserver*, which presents a handy user interface to inspect, trigger and
debug the behaviour of DAGs and tasks.
+* A *webserver*, which presents a handy user interface to inspect, trigger and
debug the behaviour of
+ DAGs and tasks.
-* A folder of *DAG files*, read by the scheduler and executor (and any workers
the executor has)
+* A folder of *DAG files*, is read by the *scheduler* to figure out what tasks
to run and when and to
+ run them.
-* A *metadata database*, used by the scheduler, executor and webserver to
store state.
+* A *metadata database*, used by the *scheduler*, and *webserver* to store
state of workflows and tasks.
+ Setting up a metadata database is described in :doc:`/howto/set-up-database`
and is required for
+ Airflow to work.
+Optional components
+...................
-Basic airflow architecture
---------------------------
+There are also some optional components that are not present in the basic
installation
-This is the basic architecture of Airflow that you'll see in simple
installations:
+* Optional *worker*, which executes the tasks given to it by the scheduler. In
the basic installation
+ worker might be part of the scheduler not a separate component. It can be
run as a long running process
+ in the :doc:`CeleryExecutor <executor/celery>`, or as a POD in the
+ :doc:`KubernetesExecutor <executor/kubernetes>`.
+
+* Optional *triggerer*, which executes deferred tasks in an async-io event
loop. In basic installation
+ where deferred tasks are not used, triggerer might not be present. More
about deferring tasks can be
+ found in :doc:`/authoring-and-scheduling/deferring`.
+
+* Optional *dag processor*, which parses DAG files and synchronizes them into
the
+ *metadata database* in basic installation *dag processor* might be part of
the scheduler not
+ a separate component.
+
+* A folder of *DAG files*, is read by *dag processor*, *workers* and
*triggerer* when they are running.
+ If *dag processor* is present *scheduler** does not need to read the *DAG
files* directly. More about
+ processing DAG files can be found in
:doc:`/authoring-and-scheduling/dagfile-processing`
+
+* Optional folder of *plugins*. Plugins are a way to extend Airflow's
functionality and by placing Python
+ files in the plugins folder you can extend Airflow functionality (similarly
as via installed packages).
+ Plugins are read by the *scheduler*, *dag processor*, *triggerer* and
*webserver*. More about
+ plugins can be found in :doc:`/authoring-and-scheduling/plugins`.
Review Comment:
```suggestion
* Optional folder of *plugins*. Plugins are a way to extend Airflow's
functionality (similar to installed packages).
Plugins are read by the *scheduler*, *dag processor*, *triggerer* and
*webserver*. More about
plugins can be found in :doc:`/authoring-and-scheduling/plugins`.
```
Would describe the key message here (what is a plugin), and leave out
deployment details.
##########
docs/apache-airflow/core-concepts/overview.rst:
##########
@@ -18,49 +18,151 @@
Architecture Overview
=====================
-Airflow is a platform that lets you build and run *workflows*. A workflow is
represented as a :doc:`DAG <dags>` (a Directed Acyclic Graph), and contains
individual pieces of work called :doc:`tasks`, arranged with dependencies and
data flows taken into account.
+Airflow is a platform that lets you build and run *workflows*. A workflow is
represented as a
+:doc:`DAG <dags>` (a Directed Acyclic Graph), and contains individual pieces
of work called
+:doc:`tasks`, arranged with dependencies and data flows taken into account.
.. image:: ../img/edge_label_example.png
:alt: An example Airflow DAG, rendered in Graph
-A DAG specifies the dependencies between Tasks, and the order in which to
execute them and run retries; the Tasks themselves describe what to do, be it
fetching data, running analysis, triggering other systems, or more.
+A DAG specifies the dependencies between Tasks, and the order in which to
execute them and run retries;
+the Tasks themselves describe what to do, be it fetching data, running
analysis, triggering other systems,
+or more.
-An Airflow installation generally consists of the following components:
+Airflow components
+------------------
-* A :doc:`scheduler <../administration-and-deployment/scheduler>`, which
handles both triggering scheduled workflows, and submitting :doc:`tasks` to the
executor to run.
+Required components
+...................
-* An :doc:`executor <executor/index>`, which handles running tasks. In the
default Airflow installation, this runs everything *inside* the scheduler, but
most production-suitable executors actually push task execution out to
*workers*.
+Minimal Airflow installation consists of the following components:
-* A *triggerer*, which executes deferred tasks - executed in an async-io event
loop.
+* A :doc:`scheduler <../administration-and-deployment/scheduler>`, which
handles both triggering scheduled
+ workflows, and submitting :doc:`tasks` to the executor to run. The
:doc:`executor <executor/index>`, is
+ a configuration property of the *scheduler*, not a separate component and
runs within the scheduler
+ process. There are several executors available out of the box, and you can
also write your own.
-* A *webserver*, which presents a handy user interface to inspect, trigger and
debug the behaviour of DAGs and tasks.
+* A *webserver*, which presents a handy user interface to inspect, trigger and
debug the behaviour of
+ DAGs and tasks.
-* A folder of *DAG files*, read by the scheduler and executor (and any workers
the executor has)
+* A folder of *DAG files*, is read by the *scheduler* to figure out what tasks
to run and when and to
+ run them.
-* A *metadata database*, used by the scheduler, executor and webserver to
store state.
+* A *metadata database*, used by the *scheduler*, and *webserver* to store
state of workflows and tasks.
+ Setting up a metadata database is described in :doc:`/howto/set-up-database`
and is required for
+ Airflow to work.
+Optional components
+...................
-Basic airflow architecture
---------------------------
+There are also some optional components that are not present in the basic
installation
-This is the basic architecture of Airflow that you'll see in simple
installations:
+* Optional *worker*, which executes the tasks given to it by the scheduler. In
the basic installation
+ worker might be part of the scheduler not a separate component. It can be
run as a long running process
+ in the :doc:`CeleryExecutor <executor/celery>`, or as a POD in the
+ :doc:`KubernetesExecutor <executor/kubernetes>`.
+
+* Optional *triggerer*, which executes deferred tasks in an async-io event
loop. In basic installation
+ where deferred tasks are not used, triggerer might not be present. More
about deferring tasks can be
+ found in :doc:`/authoring-and-scheduling/deferring`.
+
+* Optional *dag processor*, which parses DAG files and synchronizes them into
the
+ *metadata database* in basic installation *dag processor* might be part of
the scheduler not
+ a separate component.
+
+* A folder of *DAG files*, is read by *dag processor*, *workers* and
*triggerer* when they are running.
+ If *dag processor* is present *scheduler** does not need to read the *DAG
files* directly. More about
+ processing DAG files can be found in
:doc:`/authoring-and-scheduling/dagfile-processing`
+
+* Optional folder of *plugins*. Plugins are a way to extend Airflow's
functionality and by placing Python
+ files in the plugins folder you can extend Airflow functionality (similarly
as via installed packages).
+ Plugins are read by the *scheduler*, *dag processor*, *triggerer* and
*webserver*. More about
+ plugins can be found in :doc:`/authoring-and-scheduling/plugins`.
+
+Deploying Airflow components
+............................
+
+All the components are Python applications that can be deployed using various
deployment mechanisms.
+
+They deployed Python applications can have extra *installed packages*
installed in their Python environment.
+This is useful for example to install custom operators or sensors or extend
Airflow functionality with custom
+plugins.
+
+While Airflow can be run in a single machine and with simple installation
where only *scheduler* and
+*webserver* are deployed, Airflow is designed to be scalable and secure, and
is able ot run in a distributed
+environment - where various components can run on different machines, with
different security perimeters
+and can be scaled by running multiple instances of the components above. Also
while single person can run and
+manage Airflow installation, Airflow Deployment in more complex setup can
involve various roles of users
+as described in the :doc:`/security/security_model`.
+
+Airflow itself is agnostic to what you're running - it will happily
orchestrate and run anything,
+either with high-level support from one of our providers, or directly as a
command using the shell
+or Python :doc:`operators`.
+
+Architecture Diagrams
+---------------------
+
+The diagrams below show different ways to deploy Airflow - gradually from the
simple "one machine" and
+single person deployment, to a more complex deployment with separate
components, separate user roles and
+finally with more isolated security perimeters.
+
+The meaning of the different connection types in the diagrams below is as
follows:
+
+* **brown solid lines** represent *DAG files* submission and synchronization
+* **blue solid lines** represent deploying and accessing *installed packages*
and *plugins*
+* **black dashed lines** represent control flow of workers by the *scheduler*
(via executor)
+* **black solid lines** represent accessing the UI to manage execution of the
workflows
+* **red dashed lines** represent accessing the *metadata database* by all
components
+
+Basic Airflow deployment
+........................
+
+This is the simplest deployment of Airflow, usually operated and managed on a
single
+machine. Such deployment usually uses Local Executor, where the *scheduler*
and the *workers* are in
Review Comment:
Could you add a link to the local executor here?
##########
docs/apache-airflow/core-concepts/overview.rst:
##########
@@ -18,49 +18,151 @@
Architecture Overview
=====================
-Airflow is a platform that lets you build and run *workflows*. A workflow is
represented as a :doc:`DAG <dags>` (a Directed Acyclic Graph), and contains
individual pieces of work called :doc:`tasks`, arranged with dependencies and
data flows taken into account.
+Airflow is a platform that lets you build and run *workflows*. A workflow is
represented as a
+:doc:`DAG <dags>` (a Directed Acyclic Graph), and contains individual pieces
of work called
+:doc:`tasks`, arranged with dependencies and data flows taken into account.
.. image:: ../img/edge_label_example.png
:alt: An example Airflow DAG, rendered in Graph
-A DAG specifies the dependencies between Tasks, and the order in which to
execute them and run retries; the Tasks themselves describe what to do, be it
fetching data, running analysis, triggering other systems, or more.
+A DAG specifies the dependencies between Tasks, and the order in which to
execute them and run retries;
+the Tasks themselves describe what to do, be it fetching data, running
analysis, triggering other systems,
+or more.
-An Airflow installation generally consists of the following components:
+Airflow components
+------------------
-* A :doc:`scheduler <../administration-and-deployment/scheduler>`, which
handles both triggering scheduled workflows, and submitting :doc:`tasks` to the
executor to run.
+Required components
+...................
-* An :doc:`executor <executor/index>`, which handles running tasks. In the
default Airflow installation, this runs everything *inside* the scheduler, but
most production-suitable executors actually push task execution out to
*workers*.
+Minimal Airflow installation consists of the following components:
Review Comment:
```suggestion
A minimal Airflow installation consists of the following components:
```
##########
docs/apache-airflow/core-concepts/overview.rst:
##########
@@ -18,49 +18,151 @@
Architecture Overview
=====================
-Airflow is a platform that lets you build and run *workflows*. A workflow is
represented as a :doc:`DAG <dags>` (a Directed Acyclic Graph), and contains
individual pieces of work called :doc:`tasks`, arranged with dependencies and
data flows taken into account.
+Airflow is a platform that lets you build and run *workflows*. A workflow is
represented as a
+:doc:`DAG <dags>` (a Directed Acyclic Graph), and contains individual pieces
of work called
+:doc:`tasks`, arranged with dependencies and data flows taken into account.
.. image:: ../img/edge_label_example.png
:alt: An example Airflow DAG, rendered in Graph
-A DAG specifies the dependencies between Tasks, and the order in which to
execute them and run retries; the Tasks themselves describe what to do, be it
fetching data, running analysis, triggering other systems, or more.
+A DAG specifies the dependencies between Tasks, and the order in which to
execute them and run retries;
+the Tasks themselves describe what to do, be it fetching data, running
analysis, triggering other systems,
+or more.
-An Airflow installation generally consists of the following components:
+Airflow components
+------------------
-* A :doc:`scheduler <../administration-and-deployment/scheduler>`, which
handles both triggering scheduled workflows, and submitting :doc:`tasks` to the
executor to run.
+Required components
+...................
-* An :doc:`executor <executor/index>`, which handles running tasks. In the
default Airflow installation, this runs everything *inside* the scheduler, but
most production-suitable executors actually push task execution out to
*workers*.
+Minimal Airflow installation consists of the following components:
-* A *triggerer*, which executes deferred tasks - executed in an async-io event
loop.
+* A :doc:`scheduler <../administration-and-deployment/scheduler>`, which
handles both triggering scheduled
+ workflows, and submitting :doc:`tasks` to the executor to run. The
:doc:`executor <executor/index>`, is
+ a configuration property of the *scheduler*, not a separate component and
runs within the scheduler
+ process. There are several executors available out of the box, and you can
also write your own.
-* A *webserver*, which presents a handy user interface to inspect, trigger and
debug the behaviour of DAGs and tasks.
+* A *webserver*, which presents a handy user interface to inspect, trigger and
debug the behaviour of
+ DAGs and tasks.
-* A folder of *DAG files*, read by the scheduler and executor (and any workers
the executor has)
+* A folder of *DAG files*, is read by the *scheduler* to figure out what tasks
to run and when and to
+ run them.
-* A *metadata database*, used by the scheduler, executor and webserver to
store state.
+* A *metadata database*, used by the *scheduler*, and *webserver* to store
state of workflows and tasks.
+ Setting up a metadata database is described in :doc:`/howto/set-up-database`
and is required for
+ Airflow to work.
+Optional components
+...................
-Basic airflow architecture
---------------------------
+There are also some optional components that are not present in the basic
installation
-This is the basic architecture of Airflow that you'll see in simple
installations:
+* Optional *worker*, which executes the tasks given to it by the scheduler. In
the basic installation
+ worker might be part of the scheduler not a separate component. It can be
run as a long running process
+ in the :doc:`CeleryExecutor <executor/celery>`, or as a POD in the
+ :doc:`KubernetesExecutor <executor/kubernetes>`.
+
+* Optional *triggerer*, which executes deferred tasks in an async-io event
loop. In basic installation
Review Comment:
```suggestion
* Optional *triggerer*, which executes deferred tasks in an asyncio event
loop. In a basic installation
```
##########
docs/apache-airflow/core-concepts/overview.rst:
##########
@@ -18,49 +18,151 @@
Architecture Overview
=====================
-Airflow is a platform that lets you build and run *workflows*. A workflow is
represented as a :doc:`DAG <dags>` (a Directed Acyclic Graph), and contains
individual pieces of work called :doc:`tasks`, arranged with dependencies and
data flows taken into account.
+Airflow is a platform that lets you build and run *workflows*. A workflow is
represented as a
+:doc:`DAG <dags>` (a Directed Acyclic Graph), and contains individual pieces
of work called
+:doc:`tasks`, arranged with dependencies and data flows taken into account.
.. image:: ../img/edge_label_example.png
:alt: An example Airflow DAG, rendered in Graph
-A DAG specifies the dependencies between Tasks, and the order in which to
execute them and run retries; the Tasks themselves describe what to do, be it
fetching data, running analysis, triggering other systems, or more.
+A DAG specifies the dependencies between Tasks, and the order in which to
execute them and run retries;
+the Tasks themselves describe what to do, be it fetching data, running
analysis, triggering other systems,
+or more.
-An Airflow installation generally consists of the following components:
+Airflow components
+------------------
-* A :doc:`scheduler <../administration-and-deployment/scheduler>`, which
handles both triggering scheduled workflows, and submitting :doc:`tasks` to the
executor to run.
+Required components
+...................
-* An :doc:`executor <executor/index>`, which handles running tasks. In the
default Airflow installation, this runs everything *inside* the scheduler, but
most production-suitable executors actually push task execution out to
*workers*.
+Minimal Airflow installation consists of the following components:
-* A *triggerer*, which executes deferred tasks - executed in an async-io event
loop.
+* A :doc:`scheduler <../administration-and-deployment/scheduler>`, which
handles both triggering scheduled
+ workflows, and submitting :doc:`tasks` to the executor to run. The
:doc:`executor <executor/index>`, is
+ a configuration property of the *scheduler*, not a separate component and
runs within the scheduler
+ process. There are several executors available out of the box, and you can
also write your own.
-* A *webserver*, which presents a handy user interface to inspect, trigger and
debug the behaviour of DAGs and tasks.
+* A *webserver*, which presents a handy user interface to inspect, trigger and
debug the behaviour of
+ DAGs and tasks.
-* A folder of *DAG files*, read by the scheduler and executor (and any workers
the executor has)
+* A folder of *DAG files*, is read by the *scheduler* to figure out what tasks
to run and when and to
+ run them.
-* A *metadata database*, used by the scheduler, executor and webserver to
store state.
+* A *metadata database*, used by the *scheduler*, and *webserver* to store
state of workflows and tasks.
+ Setting up a metadata database is described in :doc:`/howto/set-up-database`
and is required for
+ Airflow to work.
+Optional components
+...................
-Basic airflow architecture
---------------------------
+There are also some optional components that are not present in the basic
installation
-This is the basic architecture of Airflow that you'll see in simple
installations:
+* Optional *worker*, which executes the tasks given to it by the scheduler. In
the basic installation
+ worker might be part of the scheduler not a separate component. It can be
run as a long running process
+ in the :doc:`CeleryExecutor <executor/celery>`, or as a POD in the
+ :doc:`KubernetesExecutor <executor/kubernetes>`.
+
+* Optional *triggerer*, which executes deferred tasks in an async-io event
loop. In basic installation
+ where deferred tasks are not used, triggerer might not be present. More
about deferring tasks can be
+ found in :doc:`/authoring-and-scheduling/deferring`.
+
+* Optional *dag processor*, which parses DAG files and synchronizes them into
the
+ *metadata database* in basic installation *dag processor* might be part of
the scheduler not
+ a separate component.
+
+* A folder of *DAG files*, is read by *dag processor*, *workers* and
*triggerer* when they are running.
+ If *dag processor* is present *scheduler** does not need to read the *DAG
files* directly. More about
+ processing DAG files can be found in
:doc:`/authoring-and-scheduling/dagfile-processing`
+
+* Optional folder of *plugins*. Plugins are a way to extend Airflow's
functionality and by placing Python
+ files in the plugins folder you can extend Airflow functionality (similarly
as via installed packages).
+ Plugins are read by the *scheduler*, *dag processor*, *triggerer* and
*webserver*. More about
+ plugins can be found in :doc:`/authoring-and-scheduling/plugins`.
+
+Deploying Airflow components
+............................
+
+All the components are Python applications that can be deployed using various
deployment mechanisms.
+
+They deployed Python applications can have extra *installed packages*
installed in their Python environment.
+This is useful for example to install custom operators or sensors or extend
Airflow functionality with custom
+plugins.
+
+While Airflow can be run in a single machine and with simple installation
where only *scheduler* and
+*webserver* are deployed, Airflow is designed to be scalable and secure, and
is able ot run in a distributed
+environment - where various components can run on different machines, with
different security perimeters
+and can be scaled by running multiple instances of the components above. Also
while single person can run and
+manage Airflow installation, Airflow Deployment in more complex setup can
involve various roles of users
+as described in the :doc:`/security/security_model`.
+
+Airflow itself is agnostic to what you're running - it will happily
orchestrate and run anything,
+either with high-level support from one of our providers, or directly as a
command using the shell
+or Python :doc:`operators`.
+
+Architecture Diagrams
+---------------------
+
+The diagrams below show different ways to deploy Airflow - gradually from the
simple "one machine" and
+single person deployment, to a more complex deployment with separate
components, separate user roles and
+finally with more isolated security perimeters.
+
+The meaning of the different connection types in the diagrams below is as
follows:
+
+* **brown solid lines** represent *DAG files* submission and synchronization
+* **blue solid lines** represent deploying and accessing *installed packages*
and *plugins*
+* **black dashed lines** represent control flow of workers by the *scheduler*
(via executor)
+* **black solid lines** represent accessing the UI to manage execution of the
workflows
+* **red dashed lines** represent accessing the *metadata database* by all
components
+
+Basic Airflow deployment
+........................
+
+This is the simplest deployment of Airflow, usually operated and managed on a
single
+machine. Such deployment usually uses Local Executor, where the *scheduler*
and the *workers* are in
+the same component and the *DAG files* are read directly from the local
filesystem by the *scheduler*
+and there is no separate *triggerer* to run deferred tasks and *webserver*
runs on the same machine
+as the *scheduler*.
+
+Such installation typically does not separate user roles - deployment,
configuration, operation, authoring
+and maintenance are all done by the same person and there are no security
perimeters between the components.
.. image:: ../img/diagram_basic_airflow_architecture.png
-Most executors will generally also introduce other components to let them talk
to their workers - like a task queue - but you can still think of the executor
and its workers as a single logical component in Airflow overall, handling the
actual task execution.
+If you want to run Airflow on a single machine in a simple single-machine
setup, you can skip the
+more complex diagrams below and go straight to the :ref:`overview:workloads`
section.
+
+Distributed Airflow architecture
+................................
+
+This is the architecture of Airflow where components of Airflow are
distributed among multiple machines
+and where various roles of users are introduced - **Deployment Manager**,
**DAG author**,
+**Operations User**. You can read more about those various roles in the
:doc:`/security/security_model`.
+
+In case of distributed deployment, it is much more important to consider
security aspects of the components.
+The *webserver* does not have access to the *DAG files* directly (the code you
see in the Code tab of the
+UI is synchronized via the *metadata database*). This way the *webserver*
cannot execute any
+code submitted by **DAG author**. It can only execute code that is installed
as *installed packages* or
+*plugins* by the **Deployment Manager**. Also the **Operations User** has only
access to the UI and can
+only trigger DAGs and tasks, but cannot author DAGs.
Review Comment:
```suggestion
The *webserver* does not have access to the *DAG files* directly. The code
in the Code tab of the
UI is read from the *metadata database*. The *webserver* cannot execute any
code submitted by **DAG author**. It can only execute code that is installed as
*installed packages* or *plugins* by the **Deployment Manager**. The
**Operations User** only has access to the UI and can only trigger DAGs and
tasks, but cannot author DAGs.
```
##########
docs/apache-airflow/core-concepts/overview.rst:
##########
@@ -18,49 +18,151 @@
Architecture Overview
=====================
-Airflow is a platform that lets you build and run *workflows*. A workflow is
represented as a :doc:`DAG <dags>` (a Directed Acyclic Graph), and contains
individual pieces of work called :doc:`tasks`, arranged with dependencies and
data flows taken into account.
+Airflow is a platform that lets you build and run *workflows*. A workflow is
represented as a
+:doc:`DAG <dags>` (a Directed Acyclic Graph), and contains individual pieces
of work called
+:doc:`tasks`, arranged with dependencies and data flows taken into account.
.. image:: ../img/edge_label_example.png
:alt: An example Airflow DAG, rendered in Graph
-A DAG specifies the dependencies between Tasks, and the order in which to
execute them and run retries; the Tasks themselves describe what to do, be it
fetching data, running analysis, triggering other systems, or more.
+A DAG specifies the dependencies between Tasks, and the order in which to
execute them and run retries;
+the Tasks themselves describe what to do, be it fetching data, running
analysis, triggering other systems,
+or more.
-An Airflow installation generally consists of the following components:
+Airflow components
+------------------
-* A :doc:`scheduler <../administration-and-deployment/scheduler>`, which
handles both triggering scheduled workflows, and submitting :doc:`tasks` to the
executor to run.
+Required components
+...................
-* An :doc:`executor <executor/index>`, which handles running tasks. In the
default Airflow installation, this runs everything *inside* the scheduler, but
most production-suitable executors actually push task execution out to
*workers*.
+Minimal Airflow installation consists of the following components:
-* A *triggerer*, which executes deferred tasks - executed in an async-io event
loop.
+* A :doc:`scheduler <../administration-and-deployment/scheduler>`, which
handles both triggering scheduled
+ workflows, and submitting :doc:`tasks` to the executor to run. The
:doc:`executor <executor/index>`, is
+ a configuration property of the *scheduler*, not a separate component and
runs within the scheduler
+ process. There are several executors available out of the box, and you can
also write your own.
-* A *webserver*, which presents a handy user interface to inspect, trigger and
debug the behaviour of DAGs and tasks.
+* A *webserver*, which presents a handy user interface to inspect, trigger and
debug the behaviour of
+ DAGs and tasks.
-* A folder of *DAG files*, read by the scheduler and executor (and any workers
the executor has)
+* A folder of *DAG files*, is read by the *scheduler* to figure out what tasks
to run and when and to
+ run them.
-* A *metadata database*, used by the scheduler, executor and webserver to
store state.
+* A *metadata database*, used by the *scheduler*, and *webserver* to store
state of workflows and tasks.
+ Setting up a metadata database is described in :doc:`/howto/set-up-database`
and is required for
+ Airflow to work.
+Optional components
+...................
-Basic airflow architecture
---------------------------
+There are also some optional components that are not present in the basic
installation
-This is the basic architecture of Airflow that you'll see in simple
installations:
+* Optional *worker*, which executes the tasks given to it by the scheduler. In
the basic installation
+ worker might be part of the scheduler not a separate component. It can be
run as a long running process
+ in the :doc:`CeleryExecutor <executor/celery>`, or as a POD in the
+ :doc:`KubernetesExecutor <executor/kubernetes>`.
+
+* Optional *triggerer*, which executes deferred tasks in an async-io event
loop. In basic installation
+ where deferred tasks are not used, triggerer might not be present. More
about deferring tasks can be
+ found in :doc:`/authoring-and-scheduling/deferring`.
+
+* Optional *dag processor*, which parses DAG files and synchronizes them into
the
+ *metadata database* in basic installation *dag processor* might be part of
the scheduler not
+ a separate component.
+
+* A folder of *DAG files*, is read by *dag processor*, *workers* and
*triggerer* when they are running.
+ If *dag processor* is present *scheduler** does not need to read the *DAG
files* directly. More about
+ processing DAG files can be found in
:doc:`/authoring-and-scheduling/dagfile-processing`
+
+* Optional folder of *plugins*. Plugins are a way to extend Airflow's
functionality and by placing Python
+ files in the plugins folder you can extend Airflow functionality (similarly
as via installed packages).
+ Plugins are read by the *scheduler*, *dag processor*, *triggerer* and
*webserver*. More about
+ plugins can be found in :doc:`/authoring-and-scheduling/plugins`.
+
+Deploying Airflow components
+............................
+
+All the components are Python applications that can be deployed using various
deployment mechanisms.
+
+They deployed Python applications can have extra *installed packages*
installed in their Python environment.
+This is useful for example to install custom operators or sensors or extend
Airflow functionality with custom
+plugins.
+
+While Airflow can be run in a single machine and with simple installation
where only *scheduler* and
+*webserver* are deployed, Airflow is designed to be scalable and secure, and
is able ot run in a distributed
+environment - where various components can run on different machines, with
different security perimeters
+and can be scaled by running multiple instances of the components above. Also
while single person can run and
+manage Airflow installation, Airflow Deployment in more complex setup can
involve various roles of users
+as described in the :doc:`/security/security_model`.
+
+Airflow itself is agnostic to what you're running - it will happily
orchestrate and run anything,
+either with high-level support from one of our providers, or directly as a
command using the shell
+or Python :doc:`operators`.
+
+Architecture Diagrams
+---------------------
+
+The diagrams below show different ways to deploy Airflow - gradually from the
simple "one machine" and
+single person deployment, to a more complex deployment with separate
components, separate user roles and
+finally with more isolated security perimeters.
+
+The meaning of the different connection types in the diagrams below is as
follows:
+
+* **brown solid lines** represent *DAG files* submission and synchronization
+* **blue solid lines** represent deploying and accessing *installed packages*
and *plugins*
+* **black dashed lines** represent control flow of workers by the *scheduler*
(via executor)
+* **black solid lines** represent accessing the UI to manage execution of the
workflows
+* **red dashed lines** represent accessing the *metadata database* by all
components
+
+Basic Airflow deployment
+........................
+
+This is the simplest deployment of Airflow, usually operated and managed on a
single
+machine. Such deployment usually uses Local Executor, where the *scheduler*
and the *workers* are in
+the same component and the *DAG files* are read directly from the local
filesystem by the *scheduler*
+and there is no separate *triggerer* to run deferred tasks and *webserver*
runs on the same machine
+as the *scheduler*.
Review Comment:
Let's break this sentence up into multiple sentences. How about:
```suggestion
machine. Such a deployment usually uses the LocalExecutor, where the
*scheduler* and the *workers* are in
the same Python process and the *DAG files* are read directly from the local
filesystem by the *scheduler*.
The *webserver* runs on the same machine as the *scheduler*.
```
##########
docs/apache-airflow/core-concepts/overview.rst:
##########
@@ -18,49 +18,151 @@
Architecture Overview
=====================
-Airflow is a platform that lets you build and run *workflows*. A workflow is
represented as a :doc:`DAG <dags>` (a Directed Acyclic Graph), and contains
individual pieces of work called :doc:`tasks`, arranged with dependencies and
data flows taken into account.
+Airflow is a platform that lets you build and run *workflows*. A workflow is
represented as a
+:doc:`DAG <dags>` (a Directed Acyclic Graph), and contains individual pieces
of work called
+:doc:`tasks`, arranged with dependencies and data flows taken into account.
.. image:: ../img/edge_label_example.png
:alt: An example Airflow DAG, rendered in Graph
-A DAG specifies the dependencies between Tasks, and the order in which to
execute them and run retries; the Tasks themselves describe what to do, be it
fetching data, running analysis, triggering other systems, or more.
+A DAG specifies the dependencies between Tasks, and the order in which to
execute them and run retries;
+the Tasks themselves describe what to do, be it fetching data, running
analysis, triggering other systems,
+or more.
-An Airflow installation generally consists of the following components:
+Airflow components
+------------------
-* A :doc:`scheduler <../administration-and-deployment/scheduler>`, which
handles both triggering scheduled workflows, and submitting :doc:`tasks` to the
executor to run.
+Required components
+...................
-* An :doc:`executor <executor/index>`, which handles running tasks. In the
default Airflow installation, this runs everything *inside* the scheduler, but
most production-suitable executors actually push task execution out to
*workers*.
+Minimal Airflow installation consists of the following components:
-* A *triggerer*, which executes deferred tasks - executed in an async-io event
loop.
+* A :doc:`scheduler <../administration-and-deployment/scheduler>`, which
handles both triggering scheduled
+ workflows, and submitting :doc:`tasks` to the executor to run. The
:doc:`executor <executor/index>`, is
+ a configuration property of the *scheduler*, not a separate component and
runs within the scheduler
+ process. There are several executors available out of the box, and you can
also write your own.
-* A *webserver*, which presents a handy user interface to inspect, trigger and
debug the behaviour of DAGs and tasks.
+* A *webserver*, which presents a handy user interface to inspect, trigger and
debug the behaviour of
+ DAGs and tasks.
-* A folder of *DAG files*, read by the scheduler and executor (and any workers
the executor has)
+* A folder of *DAG files*, is read by the *scheduler* to figure out what tasks
to run and when and to
+ run them.
-* A *metadata database*, used by the scheduler, executor and webserver to
store state.
+* A *metadata database*, used by the *scheduler*, and *webserver* to store
state of workflows and tasks.
+ Setting up a metadata database is described in :doc:`/howto/set-up-database`
and is required for
+ Airflow to work.
+Optional components
+...................
-Basic airflow architecture
---------------------------
+There are also some optional components that are not present in the basic
installation
-This is the basic architecture of Airflow that you'll see in simple
installations:
+* Optional *worker*, which executes the tasks given to it by the scheduler. In
the basic installation
+ worker might be part of the scheduler not a separate component. It can be
run as a long running process
+ in the :doc:`CeleryExecutor <executor/celery>`, or as a POD in the
+ :doc:`KubernetesExecutor <executor/kubernetes>`.
+
+* Optional *triggerer*, which executes deferred tasks in an async-io event
loop. In basic installation
+ where deferred tasks are not used, triggerer might not be present. More
about deferring tasks can be
+ found in :doc:`/authoring-and-scheduling/deferring`.
+
+* Optional *dag processor*, which parses DAG files and synchronizes them into
the
+ *metadata database* in basic installation *dag processor* might be part of
the scheduler not
+ a separate component.
+
+* A folder of *DAG files*, is read by *dag processor*, *workers* and
*triggerer* when they are running.
+ If *dag processor* is present *scheduler** does not need to read the *DAG
files* directly. More about
+ processing DAG files can be found in
:doc:`/authoring-and-scheduling/dagfile-processing`
+
+* Optional folder of *plugins*. Plugins are a way to extend Airflow's
functionality and by placing Python
+ files in the plugins folder you can extend Airflow functionality (similarly
as via installed packages).
+ Plugins are read by the *scheduler*, *dag processor*, *triggerer* and
*webserver*. More about
+ plugins can be found in :doc:`/authoring-and-scheduling/plugins`.
+
+Deploying Airflow components
+............................
+
+All the components are Python applications that can be deployed using various
deployment mechanisms.
+
+They deployed Python applications can have extra *installed packages*
installed in their Python environment.
+This is useful for example to install custom operators or sensors or extend
Airflow functionality with custom
+plugins.
+
+While Airflow can be run in a single machine and with simple installation
where only *scheduler* and
+*webserver* are deployed, Airflow is designed to be scalable and secure, and
is able ot run in a distributed
+environment - where various components can run on different machines, with
different security perimeters
+and can be scaled by running multiple instances of the components above. Also
while single person can run and
+manage Airflow installation, Airflow Deployment in more complex setup can
involve various roles of users
+as described in the :doc:`/security/security_model`.
+
+Airflow itself is agnostic to what you're running - it will happily
orchestrate and run anything,
+either with high-level support from one of our providers, or directly as a
command using the shell
+or Python :doc:`operators`.
+
+Architecture Diagrams
+---------------------
+
+The diagrams below show different ways to deploy Airflow - gradually from the
simple "one machine" and
+single person deployment, to a more complex deployment with separate
components, separate user roles and
+finally with more isolated security perimeters.
+
+The meaning of the different connection types in the diagrams below is as
follows:
+
+* **brown solid lines** represent *DAG files* submission and synchronization
+* **blue solid lines** represent deploying and accessing *installed packages*
and *plugins*
+* **black dashed lines** represent control flow of workers by the *scheduler*
(via executor)
+* **black solid lines** represent accessing the UI to manage execution of the
workflows
+* **red dashed lines** represent accessing the *metadata database* by all
components
+
+Basic Airflow deployment
+........................
+
+This is the simplest deployment of Airflow, usually operated and managed on a
single
+machine. Such deployment usually uses Local Executor, where the *scheduler*
and the *workers* are in
+the same component and the *DAG files* are read directly from the local
filesystem by the *scheduler*
+and there is no separate *triggerer* to run deferred tasks and *webserver*
runs on the same machine
+as the *scheduler*.
+
+Such installation typically does not separate user roles - deployment,
configuration, operation, authoring
Review Comment:
```suggestion
Such an installation typically does not separate user roles - deployment,
configuration, operation, authoring
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]