This is an automated email from the ASF dual-hosted git repository.

kaxilnaik pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/airflow.git


The following commit(s) were added to refs/heads/main by this push:
     new cc54a19bcdf Add basic dag-bundles docs; webserver->api server in 
security model (#48600)
cc54a19bcdf is described below

commit cc54a19bcdf35231fcc60a25ec93927d03fa8ddb
Author: Jed Cunningham <[email protected]>
AuthorDate: Tue Apr 1 03:36:16 2025 -0600

    Add basic dag-bundles docs; webserver->api server in security model (#48600)
---
 .../administration-and-deployment/dag-bundles.rst  | 129 +++++++++++++++++++++
 .../docs/administration-and-deployment/index.rst   |   1 +
 airflow-core/docs/core-concepts/dags.rst           |  16 +--
 airflow-core/docs/security/security_model.rst      |  42 +++----
 4 files changed, 160 insertions(+), 28 deletions(-)

diff --git a/airflow-core/docs/administration-and-deployment/dag-bundles.rst 
b/airflow-core/docs/administration-and-deployment/dag-bundles.rst
new file mode 100644
index 00000000000..27b21daa5ed
--- /dev/null
+++ b/airflow-core/docs/administration-and-deployment/dag-bundles.rst
@@ -0,0 +1,129 @@
+ .. Licensed to the Apache Software Foundation (ASF) under one
+    or more contributor license agreements.  See the NOTICE file
+    distributed with this work for additional information
+    regarding copyright ownership.  The ASF licenses this file
+    to you under the Apache License, Version 2.0 (the
+    "License"); you may not use this file except in compliance
+    with the License.  You may obtain a copy of the License at
+
+ ..   http://www.apache.org/licenses/LICENSE-2.0
+
+ .. Unless required by applicable law or agreed to in writing,
+    software distributed under the License is distributed on an
+    "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+    KIND, either express or implied.  See the License for the
+    specific language governing permissions and limitations
+    under the License.
+
+Dag Bundles
+===========
+
+Dag bundle are a collection of dags and other files (think the Airflow 2 dags 
folder). Unlike Airflow 2, where dags were required to be on local disk and 
getting
+the dags there was the sole responsibility of the deployment manager, Airflow 
3 is now able to pull dags from external systems
+as well. And since dag bundles support versioning, it also allows Airflow to 
run a task using a specific version
+of the dag bundle, allowing for a dag run to use the same code for the whole 
run, even if the dag is updated mid way through the run.
+
+What's in a dag bundle? One or more dag files along with their associated 
files, such as
+other Python scripts, configuration files, or other resources. By keeping the 
bundle at a higher level, it allows for versioning
+everything the dag needs to run.
+
+Dag bundles can source the dags from various locations, such as local 
directories, Git repositories, or other external systems.
+Deployment administrators can also write their own dag bundle classes to 
support custom sources.
+You can also define more than 1 dag bundle in an Airflow deployments, allowing 
for better organization of your dags.
+
+Why Are dag bundles important?
+------------------------------
+
+- **Version Control**: By supporting versioning, dag bundles allow dag runs to 
use the same code for the whole run, even if the dag is updated mid way through 
the run.
+- **Scalability**: With dag bundles, Airflow can efficiently manage large 
numbers of DAGs by organizing them into logical units.
+- **Flexibility**: Dag bundles enable seamless integration with external 
systems, such as Git repositories, to source dags.
+
+Types of dag bundles
+--------------------
+Airflow supports multiple types of dag Bundles, each catering to specific use 
cases:
+
+**airflow.dag_processing.bundles.local.LocalDagBundle**
+    These bundles reference a local directory containing DAG files. They are 
ideal for development and testing environments, but do not support versioning 
of the bundle, meaning tasks always run using the latest code.
+
+**airflow.providers.git.bundles.git.GitDagBundle**
+    These bundles integrate with Git repositories, allowing Airflow to fetch 
dags directly from a repository.
+
+Configuring dag bundles
+-----------------------
+
+Dag bundles are configured in 
:ref:`config:dag_processor__dag_bundle_config_list`. You can add one or more 
dag bundles here.
+
+By default, Airflow adds a local dag bundle, which is the same as the old dags 
folder. This is done for backwards compatibility, and you can remove it if you 
do not want to use it. You can also keep it and add other dag bundles, such as 
a git dag bundle.
+
+For example, adding multiple dag bundles to your ``airflow.cfg`` file:
+
+.. code-block:: ini
+
+    [dag_processor]
+    dag_bundle_config_list = [
+            {
+                "name": "my_git_repo",
+                "classpath": "airflow.dag_processing.bundles.git.GitDagBundle",
+                "kwargs": {"tracking_ref": "main", "git_conn_id": 
"my_git_conn"}
+            }
+            {
+              "name": "dags-folder",
+              "classpath": 
"airflow.dag_processing.bundles.local.LocalDagBundle",
+              "kwargs": {}
+            }
+        ]
+
+.. note::
+
+    The whitespace, particularly on the last line, is important so a 
multi-line value works properly. More details can be found in the
+    the `configparser docs 
<https://docs.python.org/3/library/configparser.html#supported-ini-file-structure>`_.
+
+You can also override the :ref:`config:dag_processor__refresh_interval` per 
dag bundle by passing it in kwargs.
+This controls how often the dag processor refreshes, or looks for new files, 
in the dag bundles.
+
+Writing custom dag bundles
+--------------------------
+
+When implementing your own dag bundle by extending the ``BaseDagBundle`` 
class, there are several methods you must implement. Below is a guide to help 
you implement a custom dag bundle.
+
+Abstract Methods
+~~~~~~~~~~~~~~~~
+The following methods are abstract and must be implemented in your custom 
bundle class:
+
+**path**
+    This property should return a ``Path`` to the directory where the dag 
files for this bundle are stored.
+    Airflow uses this property to locate the DAG files for processing.
+
+**get_current_version**
+    This method should return the current version of the bundle as a string.
+    Airflow will use pass this version to ``__init__`` later to get this 
version of the bundle again when it runs tasks.
+    If versioning is not supported, it should return ``None``.
+
+**refresh**
+    This method should handle refreshing the bundle's contents from its source 
(e.g., pulling the latest changes from a remote repository).
+    This is used by the dag processor periodically to ensure that the bundle 
is up-to-date.
+
+Optional Methods
+~~~~~~~~~~~~~~~~
+In addition to the abstract methods, you may choose to override the following 
methods to customize the behavior of your bundle:
+
+**__init__**
+    This method can be extended to initialize the bundle with extra 
parameters, such as ``tracking_ref`` for the ``GitDagBundle``.
+    It should also call the parent class's ``__init__`` method to ensure 
proper initialization.
+    Expensive operations, such as network calls, should be avoided in this 
method to prevent delays during the bundle's instantiation, and done
+    in the ``initialize`` method instead.
+
+**initialize**
+    This method is called before the bundle is first used in the dag processor 
or worker. It allows you to perform expensive operations only when the bundle's 
content is accessed.
+
+**view_url**
+    This method should return a URL as a string to view the bundle on an 
external system (e.g., a Git repository's web interface).
+
+Other Considerations
+~~~~~~~~~~~~~~~~~~~~
+
+- **Versioning**: If your bundle supports versioning, ensure that 
``initialize``, ``get_current_version`` and ``refresh`` are implemented to 
handle version-specific logic.
+
+- **Concurrency**: Workers may create many bundles simultaneously, and does 
nothing to serialize calls to the bundle objects. Thus, the bundle class must 
handle locking if
+  that is problematic for the underlying technology. For example, if you are 
cloning a git repo, the bundle class is responsible for locking to ensure only 
1 bundle
+  object is cloning at a time. There is a ``lock`` method in the base class 
that can be used for this purpose, if necessary.
diff --git a/airflow-core/docs/administration-and-deployment/index.rst 
b/airflow-core/docs/administration-and-deployment/index.rst
index 0ec0e3e4d30..ec39a526a77 100644
--- a/airflow-core/docs/administration-and-deployment/index.rst
+++ b/airflow-core/docs/administration-and-deployment/index.rst
@@ -28,6 +28,7 @@ This section contains information about deploying dags into 
production and the a
     kubernetes
     lineage
     listeners
+    dag-bundles
     dag-serialization
     modules_management
     scheduler
diff --git a/airflow-core/docs/core-concepts/dags.rst 
b/airflow-core/docs/core-concepts/dags.rst
index 3fd9a88ed8a..0c7bcb8327d 100644
--- a/airflow-core/docs/core-concepts/dags.rst
+++ b/airflow-core/docs/core-concepts/dags.rst
@@ -147,7 +147,7 @@ Chain can also do *pairwise* dependencies for lists the 
same size (this is diffe
 Loading dags
 ------------
 
-Airflow loads dags from Python source files, which it looks for inside its 
configured ``DAG_FOLDER``. It will take each file, execute it, and then load 
any DAG objects from that file.
+Airflow loads dags from Python source files in dag bundles. It will take each 
file, execute it, and then load any DAG objects from that file.
 
 This means you can define multiple dags per Python file, or even spread one 
very complex DAG across multiple Python files using imports.
 
@@ -164,11 +164,11 @@ While both DAG constructors get called when the file is 
accessed, only ``dag_1``
 
 .. note::
 
-    When searching for dags inside the ``DAG_FOLDER``, Airflow only considers 
Python files that contain the strings ``airflow`` and ``dag`` 
(case-insensitively) as an optimization.
+    When searching for dags inside the dag bundle, Airflow only considers 
Python files that contain the strings ``airflow`` and ``dag`` 
(case-insensitively) as an optimization.
 
     To consider all Python files instead, disable the 
``DAG_DISCOVERY_SAFE_MODE`` configuration flag.
 
-You can also provide an ``.airflowignore`` file inside your ``DAG_FOLDER``, or 
any of its subfolders, which describes patterns of files for the loader to 
ignore. It covers the directory it's in plus all subfolders underneath it. See  
:ref:`.airflowignore <concepts:airflowignore>` below for details of the file 
syntax.
+You can also provide an ``.airflowignore`` file inside your dag bundle, or any 
of its subfolders, which describes patterns of files for the loader to ignore. 
It covers the directory it's in plus all subfolders underneath it. See  
:ref:`.airflowignore <concepts:airflowignore>` below for details of the file 
syntax.
 
 In the case where the ``.airflowignore`` does not meet your needs and you want 
a more flexible way to control if a python file needs to be parsed by airflow, 
you can plug your callable by setting ``might_contain_dag_callable`` in the 
config file.
 Note, this callable will replace the default Airflow heuristic, i.e. checking 
if the strings ``airflow`` and ``dag`` (case-insensitively) are present in the 
python file.
@@ -691,7 +691,7 @@ Packaging dags
 
 While simpler dags are usually only in a single Python file, it is not 
uncommon that more complex dags might be spread across multiple files and have 
dependencies that should be shipped with them ("vendored").
 
-You can either do this all inside of the ``DAG_FOLDER``, with a standard 
filesystem layout, or you can package the DAG and all of its Python files up as 
a single zip file. For instance, you could ship two dags along with a 
dependency they need as a zip file with the following contents::
+You can either do this all inside of the dag bundle, with a standard 
filesystem layout, or you can package the DAG and all of its Python files up as 
a single zip file. For instance, you could ship two dags along with a 
dependency they need as a zip file with the following contents::
 
     my_dag1.py
     my_dag2.py
@@ -711,7 +711,7 @@ In general, if you have a complex set of compiled 
dependencies and modules, you
 ``.airflowignore``
 ------------------
 
-An ``.airflowignore`` file specifies the directories or files in ``DAG_FOLDER``
+An ``.airflowignore`` file specifies the directories or files in the dag bundle
 or ``PLUGINS_FOLDER`` that Airflow should intentionally ignore. Airflow 
supports
 two syntax flavors for patterns in the file, as specified by the 
``DAG_IGNORE_FILE_SYNTAX``
 configuration parameter (*added in Airflow 2.3*): ``regexp`` and ``glob``.
@@ -740,7 +740,7 @@ match any of the patterns would be ignored (under the hood, 
``Pattern.search()``
 to match the pattern). Use the ``#`` character to indicate a comment; all 
characters
 on lines starting with ``#`` will be ignored.
 
-The ``.airflowignore`` file should be put in your ``DAG_FOLDER``. For example, 
you can prepare
+The ``.airflowignore`` file should be put in your dag bundle. For example, you 
can prepare
 a ``.airflowignore`` file with the ``glob`` syntax
 
 .. code-block::
@@ -749,12 +749,12 @@ a ``.airflowignore`` file with the ``glob`` syntax
     tenant_[0-9]*
 
 Then files like ``project_a_dag_1.py``, ``TESTING_project_a.py``, 
``tenant_1.py``,
-``project_a/dag_1.py``, and ``tenant_1/dag_1.py`` in your ``DAG_FOLDER`` would 
be ignored
+``project_a/dag_1.py``, and ``tenant_1/dag_1.py`` in your dag bundle would be 
ignored
 (If a directory's name matches any of the patterns, this directory and all its 
subfolders
 would not be scanned by Airflow at all. This improves efficiency of DAG 
finding).
 
 The scope of a ``.airflowignore`` file is the directory it is in plus all its 
subfolders.
-You can also prepare ``.airflowignore`` file for a subfolder in ``DAG_FOLDER`` 
and it
+You can also prepare ``.airflowignore`` file for a subfolder in your dag 
bundle and it
 would only be applicable for that subfolder.
 
 DAG Dependencies
diff --git a/airflow-core/docs/security/security_model.rst 
b/airflow-core/docs/security/security_model.rst
index 3e73924f71a..cf19d0a276e 100644
--- a/airflow-core/docs/security/security_model.rst
+++ b/airflow-core/docs/security/security_model.rst
@@ -62,9 +62,9 @@ DAG Authors
 ...........
 
 They can create, modify, and delete DAG files. The
-code in DAG files is executed on workers and in the DAG File Processor.
+code in DAG files is executed on workers and in the DAG Processor.
 Therefore, DAG authors can create and change code executed on workers
-and the DAG File Processor and potentially access the credentials that the DAG
+and the DAG Processor and potentially access the credentials that the DAG
 code uses to access external systems. DAG Authors have full access
 to the metadata database.
 
@@ -100,7 +100,7 @@ to abuse these privileges. They have access to sensitive 
credentials
 and can modify them. By default, they don't have access to
 system-level configuration. They should be trusted not to misuse
 sensitive information accessible through connection configuration.
-They also have the ability to create a Webserver Denial of Service
+They also have the ability to create a API Server Denial of Service
 situation and should be trusted not to misuse this capability.
 
 Only admin users have access to audit logs.
@@ -119,7 +119,7 @@ required to prevent misuse of these privileges. They have 
full access
 to sensitive credentials stored in connections and can modify them.
 Access to sensitive information through connection configuration
 should be trusted not to be abused. They also have the ability to configure 
connections wrongly
-that might create a Webserver Denial of Service situations and specify 
insecure connection options
+that might create a API Server Denial of Service situations and specify 
insecure connection options
 which might create situations where executing dags will lead to arbitrary 
Remote Code Execution
 for some providers - either community released or custom ones.
 
@@ -149,11 +149,11 @@ For more information on the capabilities of authenticated 
UI users, see :doc:`ap
 Capabilities of DAG Authors
 ---------------------------
 
-DAG authors are able to submit code - via Python files placed in the 
DAGS_FOLDER - that will be executed
+DAG authors are able to create or edit code - via Python files placed in a dag 
bundle - that will be executed
 in a number of circumstances. The code to execute is neither verified, checked 
nor sand-boxed by Airflow
 (that would be very difficult if not impossible to do), so effectively DAG 
authors can execute arbitrary
 code on the workers (part of Celery Workers for Celery Executor, local 
processes run by scheduler in case
-of Local Executor, Task Kubernetes POD in case of Kubernetes Executor), in the 
DAG File Processor
+of Local Executor, Task Kubernetes POD in case of Kubernetes Executor), in the 
DAG Processor
 and in the Triggerer.
 
 There are several consequences of this model chosen by Airflow, that 
deployment managers need to be aware of:
@@ -190,28 +190,30 @@ enforcement mechanisms that would allow to isolate tasks 
that are using deferrab
 each other and arbitrary code from various tasks can be executed in the same 
process/machine. Deployment
 Manager must trust that DAG authors will not abuse this capability.
 
-DAG files not needed for Scheduler and Webserver
-................................................
+DAG files not needed for Scheduler and API Server
+.................................................
 
 The Deployment Manager might isolate the code execution provided by DAG 
authors - particularly in
-Scheduler and Webserver by making sure that the Scheduler and Webserver don't 
even
+Scheduler and API Server by making sure that the Scheduler and API Server 
don't even
 have access to the DAG Files. Generally speaking - no DAG author provided code 
should ever be
-executed in the Scheduler or Webserver process.
+executed in the Scheduler or API Server process. This means the deployment 
manager can exclude credentials
+needed for dag bundles on the Scheduler and API Server - but the bundles must 
still be configured on those
+components.
 
-Allowing DAG authors to execute selected code in Scheduler and Webserver
-........................................................................
+Allowing DAG authors to execute selected code in Scheduler and API Server
+.........................................................................
 
 There are a number of functionalities that allow the DAG author to use 
pre-registered custom code to be
-executed in scheduler or webserver process - for example they can choose 
custom Timetables, UI plugins,
+executed in the Scheduler or API Server process - for example they can choose 
custom Timetables, UI plugins,
 Connection UI Fields, Operator extra links, macros, listeners - all of those 
functionalities allow the
-DAG author to choose the code that will be executed in the scheduler or 
webserver process. However this
-should not be arbitrary code that DAG author can add in DAG folder. All those 
functionalities are
+DAG author to choose the code that will be executed in the Scheduler or API 
Server process. However this
+should not be arbitrary code that DAG author can add dag bundles. All those 
functionalities are
 only available via ``plugins`` and ``providers`` mechanisms where the code 
that is executed can only be
 provided by installed packages (or in case of plugins it can also be added to 
PLUGINS folder where DAG
 authors should not have write access to). PLUGINS_FOLDER is a legacy mechanism 
coming from Airflow 1.10
 - but we recommend using entrypoint mechanism that allows the Deployment 
Manager to - effectively -
 choose and register the code that will be executed in those contexts. DAG 
Author has no access to
-install or modify packages installed in Webserver and Scheduler, and this is 
the way to prevent
+install or modify packages installed in Scheduler and API Server, and this is 
the way to prevent
 the DAG Author to execute arbitrary code in those processes.
 
 Additionally, if you decide to utilize and configure the PLUGINS_FOLDER, it is 
essential for the Deployment
@@ -224,7 +226,7 @@ following chapter.
 Access to all dags
 ........................................................................
 
-All dag authors have access to all dags in the airflow deployment. This means 
that they can view, modify,
+All dag authors have access to all dags in the Airflow deployment. This means 
that they can view, modify,
 and update any dag without restrictions at any time.
 
 Responsibilities of Deployment Managers
@@ -233,7 +235,7 @@ Responsibilities of Deployment Managers
 As a Deployment Manager, you should be aware of the capabilities of DAG 
authors and make sure that
 you trust them not to abuse the capabilities they have. You should also make 
sure that you have
 properly configured the Airflow installation to prevent DAG authors from 
executing arbitrary code
-in the Scheduler and Webserver processes.
+in the Scheduler and API Server processes.
 
 Deploying and protecting Airflow installation
 .............................................
@@ -256,9 +258,9 @@ Limiting DAG Author capabilities
 The Deployment Manager might also use additional mechanisms to prevent DAG 
authors from executing
 arbitrary code - for example they might introduce tooling around DAG 
submission that would allow
 to review the code before it is deployed, statically-check it and add other 
ways to prevent malicious
-code to be submitted. The way how submitting code to DAG folder is done and 
protected is completely
+code to be submitted. The way submitting code to a DAG bundle is done and 
protected is completely
 up to the Deployment Manager - Airflow does not provide any tooling or 
mechanisms around it and it
-expects that the Deployment Manager will provide the tooling to protect access 
to the DAG folder and
+expects that the Deployment Manager will provide the tooling to protect access 
to DAG bundles and
 make sure that only trusted code is submitted there.
 
 Airflow does not implement any of those feature natively, and delegates it to 
the deployment managers

Reply via email to