potiuk commented on code in PR #36513: URL: https://github.com/apache/airflow/pull/36513#discussion_r1440820704
########## docs/apache-airflow/core-concepts/overview.rst: ########## @@ -18,49 +18,151 @@ Architecture Overview ===================== -Airflow is a platform that lets you build and run *workflows*. A workflow is represented as a :doc:`DAG <dags>` (a Directed Acyclic Graph), and contains individual pieces of work called :doc:`tasks`, arranged with dependencies and data flows taken into account. +Airflow is a platform that lets you build and run *workflows*. A workflow is represented as a +:doc:`DAG <dags>` (a Directed Acyclic Graph), and contains individual pieces of work called +:doc:`tasks`, arranged with dependencies and data flows taken into account. .. image:: ../img/edge_label_example.png :alt: An example Airflow DAG, rendered in Graph -A DAG specifies the dependencies between Tasks, and the order in which to execute them and run retries; the Tasks themselves describe what to do, be it fetching data, running analysis, triggering other systems, or more. +A DAG specifies the dependencies between Tasks, and the order in which to execute them and run retries; +the Tasks themselves describe what to do, be it fetching data, running analysis, triggering other systems, +or more. -An Airflow installation generally consists of the following components: +Airflow components +------------------ -* A :doc:`scheduler <../administration-and-deployment/scheduler>`, which handles both triggering scheduled workflows, and submitting :doc:`tasks` to the executor to run. +Required components +................... -* An :doc:`executor <executor/index>`, which handles running tasks. In the default Airflow installation, this runs everything *inside* the scheduler, but most production-suitable executors actually push task execution out to *workers*. +Minimal Airflow installation consists of the following components: -* A *triggerer*, which executes deferred tasks - executed in an async-io event loop. +* A :doc:`scheduler <../administration-and-deployment/scheduler>`, which handles both triggering scheduled + workflows, and submitting :doc:`tasks` to the executor to run. The :doc:`executor <executor/index>`, is + a configuration property of the *scheduler*, not a separate component and runs within the scheduler + process. There are several executors available out of the box, and you can also write your own. -* A *webserver*, which presents a handy user interface to inspect, trigger and debug the behaviour of DAGs and tasks. +* A *webserver*, which presents a handy user interface to inspect, trigger and debug the behaviour of + DAGs and tasks. -* A folder of *DAG files*, read by the scheduler and executor (and any workers the executor has) +* A folder of *DAG files*, is read by the *scheduler* to figure out what tasks to run and when and to + run them. -* A *metadata database*, used by the scheduler, executor and webserver to store state. +* A *metadata database*, used by the *scheduler*, and *webserver* to store state of workflows and tasks. + Setting up a metadata database is described in :doc:`/howto/set-up-database` and is required for + Airflow to work. +Optional components +................... -Basic airflow architecture --------------------------- +There are also some optional components that are not present in the basic installation -This is the basic architecture of Airflow that you'll see in simple installations: +* Optional *worker*, which executes the tasks given to it by the scheduler. In the basic installation + worker might be part of the scheduler not a separate component. It can be run as a long running process + in the :doc:`CeleryExecutor <executor/celery>`, or as a POD in the + :doc:`KubernetesExecutor <executor/kubernetes>`. + +* Optional *triggerer*, which executes deferred tasks in an async-io event loop. In basic installation + where deferred tasks are not used, triggerer might not be present. More about deferring tasks can be + found in :doc:`/authoring-and-scheduling/deferring`. + +* Optional *dag processor*, which parses DAG files and synchronizes them into the + *metadata database* in basic installation *dag processor* might be part of the scheduler not + a separate component. + +* A folder of *DAG files*, is read by *dag processor*, *workers* and *triggerer* when they are running. + If *dag processor* is present *scheduler** does not need to read the *DAG files* directly. More about + processing DAG files can be found in :doc:`/authoring-and-scheduling/dagfile-processing` + +* Optional folder of *plugins*. Plugins are a way to extend Airflow's functionality and by placing Python + files in the plugins folder you can extend Airflow functionality (similarly as via installed packages). + Plugins are read by the *scheduler*, *dag processor*, *triggerer* and *webserver*. More about + plugins can be found in :doc:`/authoring-and-scheduling/plugins`. + +Deploying Airflow components +............................ + +All the components are Python applications that can be deployed using various deployment mechanisms. + +They deployed Python applications can have extra *installed packages* installed in their Python environment. +This is useful for example to install custom operators or sensors or extend Airflow functionality with custom +plugins. + +While Airflow can be run in a single machine and with simple installation where only *scheduler* and +*webserver* are deployed, Airflow is designed to be scalable and secure, and is able ot run in a distributed +environment - where various components can run on different machines, with different security perimeters +and can be scaled by running multiple instances of the components above. Also while single person can run and +manage Airflow installation, Airflow Deployment in more complex setup can involve various roles of users +as described in the :doc:`/security/security_model`. Review Comment: I split that into three separate paragraphs. - one about scalability one about security and following chapter about roles. I decided for now not to split out separate "roles" document. I just briefly mentioned the roles and described it here, and linked to the "security model" for more details. I also moved the following chapter ``` Airflow itself is agnostic to what you're running - it will happily orchestrate and run anything, either with high-level support from one of our providers, or directly as a command using the shell or Python :doc:`operators`. ``` to the top-level description, as it feels that it should be there rather than in the "components" section. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
