o-nikolas commented on code in PR #25780:
URL: https://github.com/apache/airflow/pull/25780#discussion_r951796105
##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
---------------------------
Some database migrations can be time-consuming. If your metadata database is
very large, consider pruning some of the old data with the :ref:`db
clean<cli-db-clean>` command prior to performing the upgrade. *Use with
caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies
are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single
set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks
require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services,
there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow
dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community
guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you
can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting
dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined
operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you
use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write
your own Custom Operator,
+you might get to the point where dependencies required by the custom code of
yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators
introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem.
And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a
bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PreexistingPythonVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some
limits and overhead), and
+we will gradually go through those strategies that requires some changes in
your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The
PythonVirtualenvOperator allows you to dynamically
+create virtualenv that your Python callable function will execute in. In the
modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be
done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using
the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have it's own independent Python virtualenv and can specify fine-grained set
of requirements that need
+to be installed for that task to execute.
+
+The operator takes care about:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the
virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via
xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created
before task is run, and
+ removed after it is finished, so there is nothing special (except having
virtualenv package in your
+ airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers -
thus Memory resources are
+ reused (though see below about the CPU overhead involved in creating the
venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the
venvs for you.
+ As DAG Author, you only have to have virtualenv dependency installed and you
can specify and modify the
+ environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or
Docker, or Kubernetes,
+ the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as DAG Author. Only
knowledge of Python, requirements
+ is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by the operator:
+
+* Your python callable has to be serializable. There are a number of python
objects that are not serializable
+ using standard ``pickle`` library. You can mitigate some of those
limitations by using ``dill`` library
+ but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be
locally imported in the callable you
+ use and the top-level Python code of your DAG should not import/use those
libraries.
+* The virtual environments are run in the same operating system, so they
cannot have conflicting system-level
+ dependencies (``apt`` or ``yum`` installable packages). Only Python
dependencies can be independently
+ installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running
each task - Airflow has
+ to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install
dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for
example when your repo is not available
+ or when there is a networking issue with reaching the repository
+* It's easy to fall into a "too" dynamic environment - since the dependencies
you install might get upgraded
+ and their transitive dependencies might get independent upgrades you might
end up with the situation where
+ your task will stop working because someone released a new version of a
dependency or you might fall
+ a victim of "supply chain" attack where new version of a dependency might
become malicious
+* The tasks are only isolated from each other via running in different
environments. This makes it possible
+ that running tasks will still interfere with each other - for example
subsequent tasks executed on the
+ same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PreexistingPythonVirtualenvOperator
+-----------------------------------------
+
+.. versionadded:: 2.4
+
+A bit more complex but with significantly less overhead, security, stability
problems is to use the
+:class:`airflow.operators.python.PreexistingPythonVirtualenvOperator``, or
even better - decorating your callable with
+``@task.preexisting_virtualenv`` decorator. It requires however that the
virtualenv you use is immutable
+by the task and prepared upfront in your environment (and available in all the
workers in case your
+Airflow runs in a distributed environments). This way you avoid the overhead
and problems of re-creating the
+virtual environment but they have to be prepared and deployed together with
Airflow installation, so usually people
+who manage Airflow installation need to be involved (and in bigger
installation those are usually different
Review Comment:
```suggestion
who manage Airflow installation need to be involved (and in bigger
installations those are usually different
```
##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
---------------------------
Some database migrations can be time-consuming. If your metadata database is
very large, consider pruning some of the old data with the :ref:`db
clean<cli-db-clean>` command prior to performing the upgrade. *Use with
caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies
are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single
set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks
require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services,
there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow
dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community
guarantees that Airflow can be installed with
Review Comment:
```suggestion
which means that you have a "fixed" set of dependencies that the community
guarantees that Airflow can be installed with
```
##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
---------------------------
Some database migrations can be time-consuming. If your metadata database is
very large, consider pruning some of the old data with the :ref:`db
clean<cli-db-clean>` command prior to performing the upgrade. *Use with
caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies
are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single
set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks
require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services,
there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow
dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community
guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you
can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting
dependency is lower (you still have
Review Comment:
```suggestion
independently and their constraints do not limit you so the chance of a
conflicting dependency is lower (you still have
```
##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
---------------------------
Some database migrations can be time-consuming. If your metadata database is
very large, consider pruning some of the old data with the :ref:`db
clean<cli-db-clean>` command prior to performing the upgrade. *Use with
caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies
are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single
set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks
require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services,
there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow
dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community
guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you
can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting
dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined
operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you
use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write
your own Custom Operator,
+you might get to the point where dependencies required by the custom code of
yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators
introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem.
And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a
bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PreexistingPythonVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some
limits and overhead), and
+we will gradually go through those strategies that requires some changes in
your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The
PythonVirtualenvOperator allows you to dynamically
+create virtualenv that your Python callable function will execute in. In the
modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be
done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using
the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have it's own independent Python virtualenv and can specify fine-grained set
of requirements that need
+to be installed for that task to execute.
+
+The operator takes care about:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the
virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via
xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created
before task is run, and
+ removed after it is finished, so there is nothing special (except having
virtualenv package in your
+ airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers -
thus Memory resources are
+ reused (though see below about the CPU overhead involved in creating the
venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the
venvs for you.
+ As DAG Author, you only have to have virtualenv dependency installed and you
can specify and modify the
+ environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or
Docker, or Kubernetes,
+ the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as DAG Author. Only
knowledge of Python, requirements
+ is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by the operator:
+
+* Your python callable has to be serializable. There are a number of python
objects that are not serializable
+ using standard ``pickle`` library. You can mitigate some of those
limitations by using ``dill`` library
+ but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be
locally imported in the callable you
+ use and the top-level Python code of your DAG should not import/use those
libraries.
+* The virtual environments are run in the same operating system, so they
cannot have conflicting system-level
+ dependencies (``apt`` or ``yum`` installable packages). Only Python
dependencies can be independently
+ installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running
each task - Airflow has
+ to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install
dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for
example when your repo is not available
+ or when there is a networking issue with reaching the repository
+* It's easy to fall into a "too" dynamic environment - since the dependencies
you install might get upgraded
+ and their transitive dependencies might get independent upgrades you might
end up with the situation where
+ your task will stop working because someone released a new version of a
dependency or you might fall
+ a victim of "supply chain" attack where new version of a dependency might
become malicious
+* The tasks are only isolated from each other via running in different
environments. This makes it possible
+ that running tasks will still interfere with each other - for example
subsequent tasks executed on the
+ same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PreexistingPythonVirtualenvOperator
+-----------------------------------------
+
+.. versionadded:: 2.4
+
+A bit more complex but with significantly less overhead, security, stability
problems is to use the
+:class:`airflow.operators.python.PreexistingPythonVirtualenvOperator``, or
even better - decorating your callable with
+``@task.preexisting_virtualenv`` decorator. It requires however that the
virtualenv you use is immutable
+by the task and prepared upfront in your environment (and available in all the
workers in case your
+Airflow runs in a distributed environments). This way you avoid the overhead
and problems of re-creating the
+virtual environment but they have to be prepared and deployed together with
Airflow installation, so usually people
+who manage Airflow installation need to be involved (and in bigger
installation those are usually different
+people than DAG Authors (DevOps/System Admins).
+
+Those virtual environments can be prepared in various ways - if you use
LocalExecutor they just need to be installed
+at the machine where scheduler is run, if you are using distributed Celery
virtualenv installations, there
+should be a pipeline that installs those virtual environments across multiple
machines, finally if you are using
+Docker Image (for example via Kubernetes), the virtualenv creation should be
added to the pipeline of
+your custom image building.
+
+The benefits of the operator are:
+
+* No setup overhead when running the task. The virtualenv is ready when you
start running a task.
+* You can run tasks with different sets of dependencies on the same workers -
thus all resources are reused.
+* There is no need to have access by workers to PyPI or private repositories.
Less chance for transient
+ errors resulting from networking.
+* The dependencies can be pre-vetted by the admins and your security team, no
unexpected, new code will
+ be added dynamically. This is good for both, security and stability.
+* Limited impact on your deployment - you do not need to switch to Docker
containers or Kubernetes to
+ make a good use of the operator.
+* No need to learn more about containers, Kubernetes as DAG Author. Only
knowledge of Python, requirements
Review Comment:
```suggestion
* No need to learn more about containers, Kubernetes as a DAG Author. Only
knowledge of Python, requirements
```
##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
---------------------------
Some database migrations can be time-consuming. If your metadata database is
very large, consider pruning some of the old data with the :ref:`db
clean<cli-db-clean>` command prior to performing the upgrade. *Use with
caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies
are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single
set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks
require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services,
there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow
dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community
guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you
can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting
dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined
operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you
use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write
your own Custom Operator,
+you might get to the point where dependencies required by the custom code of
yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators
introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem.
And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a
bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PreexistingPythonVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some
limits and overhead), and
+we will gradually go through those strategies that requires some changes in
your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The
PythonVirtualenvOperator allows you to dynamically
+create virtualenv that your Python callable function will execute in. In the
modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be
done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using
the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have it's own independent Python virtualenv and can specify fine-grained set
of requirements that need
+to be installed for that task to execute.
+
+The operator takes care about:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the
virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via
xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created
before task is run, and
+ removed after it is finished, so there is nothing special (except having
virtualenv package in your
+ airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers -
thus Memory resources are
+ reused (though see below about the CPU overhead involved in creating the
venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the
venvs for you.
+ As DAG Author, you only have to have virtualenv dependency installed and you
can specify and modify the
+ environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or
Docker, or Kubernetes,
+ the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as DAG Author. Only
knowledge of Python, requirements
+ is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by the operator:
+
+* Your python callable has to be serializable. There are a number of python
objects that are not serializable
+ using standard ``pickle`` library. You can mitigate some of those
limitations by using ``dill`` library
+ but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be
locally imported in the callable you
+ use and the top-level Python code of your DAG should not import/use those
libraries.
+* The virtual environments are run in the same operating system, so they
cannot have conflicting system-level
+ dependencies (``apt`` or ``yum`` installable packages). Only Python
dependencies can be independently
+ installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running
each task - Airflow has
+ to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install
dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for
example when your repo is not available
+ or when there is a networking issue with reaching the repository
+* It's easy to fall into a "too" dynamic environment - since the dependencies
you install might get upgraded
+ and their transitive dependencies might get independent upgrades you might
end up with the situation where
+ your task will stop working because someone released a new version of a
dependency or you might fall
+ a victim of "supply chain" attack where new version of a dependency might
become malicious
+* The tasks are only isolated from each other via running in different
environments. This makes it possible
+ that running tasks will still interfere with each other - for example
subsequent tasks executed on the
+ same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PreexistingPythonVirtualenvOperator
+-----------------------------------------
+
+.. versionadded:: 2.4
+
+A bit more complex but with significantly less overhead, security, stability
problems is to use the
+:class:`airflow.operators.python.PreexistingPythonVirtualenvOperator``, or
even better - decorating your callable with
+``@task.preexisting_virtualenv`` decorator. It requires however that the
virtualenv you use is immutable
+by the task and prepared upfront in your environment (and available in all the
workers in case your
+Airflow runs in a distributed environments). This way you avoid the overhead
and problems of re-creating the
+virtual environment but they have to be prepared and deployed together with
Airflow installation, so usually people
+who manage Airflow installation need to be involved (and in bigger
installation those are usually different
+people than DAG Authors (DevOps/System Admins).
+
+Those virtual environments can be prepared in various ways - if you use
LocalExecutor they just need to be installed
+at the machine where scheduler is run, if you are using distributed Celery
virtualenv installations, there
+should be a pipeline that installs those virtual environments across multiple
machines, finally if you are using
+Docker Image (for example via Kubernetes), the virtualenv creation should be
added to the pipeline of
+your custom image building.
+
+The benefits of the operator are:
+
+* No setup overhead when running the task. The virtualenv is ready when you
start running a task.
+* You can run tasks with different sets of dependencies on the same workers -
thus all resources are reused.
+* There is no need to have access by workers to PyPI or private repositories.
Less chance for transient
+ errors resulting from networking.
+* The dependencies can be pre-vetted by the admins and your security team, no
unexpected, new code will
+ be added dynamically. This is good for both, security and stability.
+* Limited impact on your deployment - you do not need to switch to Docker
containers or Kubernetes to
+ make a good use of the operator.
+* No need to learn more about containers, Kubernetes as DAG Author. Only
knowledge of Python, requirements
+ is required to author DAGs this way.
+
+The drawbacks:
+
+* Your environment needs to have the virtual environments prepared upfront.
This usually means that you
+ cannot change it on the flight, adding new or changing requirements require
at least airflow re-deployment
+ and iteration time when you work on new versions might be longer.
+* Your python callable has to be serializable. There are a number of python
objects that are not serializable
+ using standard ``pickle`` library. You can mitigate some of those
limitations by using ``dill`` library
+ but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be
locally imported in the callable you
+ use and the top-level Python code of your DAG should not import/use those
libraries.
+* The virtual environments are run in the same operating system, so they
cannot have conflicting system-level
+ dependencies (``apt`` or ``yum`` installable packages). Only Python
dependencies can be independently
+ installed in those environments
+* The tasks are only isolated from each other via running in different
environments. This makes it possible
+ that running tasks will still interfere with each other - for example
subsequent tasks executed on the
+ same worker might be affected by previous tasks creating/modifying files et.c
+
+Actually, you can think about the ``PythonVirtualenvOperator`` and
``PreexistingPythonVirtualenvOperator``
+as counterparts - as DAG author you'd normally iterate with dependencies and
develop your DAG using
Review Comment:
```suggestion
as counterparts - as a DAG author you'd normally iterate with dependencies
and develop your DAG using
```
##########
airflow/decorators/__init__.pyi:
##########
@@ -123,6 +125,37 @@ class TaskDecoratorCollection:
"""
@overload
def virtualenv(self, python_callable: Callable[FParams, FReturn]) ->
Task[FParams, FReturn]: ...
+ def preexisting_virtualenv(
+ self,
+ *,
+ python: str,
+ multiple_outputs: Optional[bool] = None,
+ # 'python_callable', 'op_args' and 'op_kwargs' since they are filled by
+ # _PythonVirtualenvDecoratedOperator.
+ use_dill: bool = False,
+ templates_dict: Optional[Mapping[str, Any]] = None,
+ show_return_value_in_logs: bool = True,
+ **kwargs,
+ ) -> TaskDecorator:
+ """Create a decorator to convert the decorated callable to a virtual
environment task.
+
+ :param python: Full time path string (file-system specific) that
points to a Python binary inside
Review Comment:
```suggestion
:param python: Full path string (file-system specific) that points
to a Python binary inside
```
##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
---------------------------
Some database migrations can be time-consuming. If your metadata database is
very large, consider pruning some of the old data with the :ref:`db
clean<cli-db-clean>` command prior to performing the upgrade. *Use with
caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies
are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single
set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks
require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services,
there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow
dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community
guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you
can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting
dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined
operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you
use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write
your own Custom Operator,
+you might get to the point where dependencies required by the custom code of
yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators
introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem.
And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a
bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PreexistingPythonVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some
limits and overhead), and
+we will gradually go through those strategies that requires some changes in
your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The
PythonVirtualenvOperator allows you to dynamically
+create virtualenv that your Python callable function will execute in. In the
modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be
done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using
the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have it's own independent Python virtualenv and can specify fine-grained set
of requirements that need
+to be installed for that task to execute.
+
+The operator takes care about:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the
virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via
xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created
before task is run, and
+ removed after it is finished, so there is nothing special (except having
virtualenv package in your
+ airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers -
thus Memory resources are
+ reused (though see below about the CPU overhead involved in creating the
venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the
venvs for you.
+ As DAG Author, you only have to have virtualenv dependency installed and you
can specify and modify the
+ environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or
Docker, or Kubernetes,
+ the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as DAG Author. Only
knowledge of Python, requirements
Review Comment:
```suggestion
* No need to learn more about containers, Kubernetes as a DAG Author. Only
knowledge of Python requirements
```
##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
---------------------------
Some database migrations can be time-consuming. If your metadata database is
very large, consider pruning some of the old data with the :ref:`db
clean<cli-db-clean>` command prior to performing the upgrade. *Use with
caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies
are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single
set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks
require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services,
there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow
dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community
guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you
can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting
dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined
operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you
use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write
your own Custom Operator,
+you might get to the point where dependencies required by the custom code of
yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators
introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem.
And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a
bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PreexistingPythonVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some
limits and overhead), and
+we will gradually go through those strategies that requires some changes in
your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The
PythonVirtualenvOperator allows you to dynamically
+create virtualenv that your Python callable function will execute in. In the
modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be
done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using
the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have it's own independent Python virtualenv and can specify fine-grained set
of requirements that need
+to be installed for that task to execute.
+
+The operator takes care about:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the
virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via
xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created
before task is run, and
+ removed after it is finished, so there is nothing special (except having
virtualenv package in your
+ airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers -
thus Memory resources are
+ reused (though see below about the CPU overhead involved in creating the
venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the
venvs for you.
+ As DAG Author, you only have to have virtualenv dependency installed and you
can specify and modify the
+ environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or
Docker, or Kubernetes,
+ the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as DAG Author. Only
knowledge of Python, requirements
+ is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by the operator:
Review Comment:
```suggestion
There are certain limitations and overhead introduced by this operator:
```
##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
---------------------------
Some database migrations can be time-consuming. If your metadata database is
very large, consider pruning some of the old data with the :ref:`db
clean<cli-db-clean>` command prior to performing the upgrade. *Use with
caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies
are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single
set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks
require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services,
there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow
dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community
guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you
can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting
dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined
operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you
use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write
your own Custom Operator,
+you might get to the point where dependencies required by the custom code of
yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators
introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem.
And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a
bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PreexistingPythonVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some
limits and overhead), and
+we will gradually go through those strategies that requires some changes in
your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The
PythonVirtualenvOperator allows you to dynamically
+create virtualenv that your Python callable function will execute in. In the
modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be
done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using
the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have it's own independent Python virtualenv and can specify fine-grained set
of requirements that need
+to be installed for that task to execute.
+
+The operator takes care about:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the
virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via
xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created
before task is run, and
+ removed after it is finished, so there is nothing special (except having
virtualenv package in your
+ airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers -
thus Memory resources are
+ reused (though see below about the CPU overhead involved in creating the
venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the
venvs for you.
+ As DAG Author, you only have to have virtualenv dependency installed and you
can specify and modify the
+ environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or
Docker, or Kubernetes,
+ the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as DAG Author. Only
knowledge of Python, requirements
+ is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by the operator:
+
+* Your python callable has to be serializable. There are a number of python
objects that are not serializable
+ using standard ``pickle`` library. You can mitigate some of those
limitations by using ``dill`` library
+ but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be
locally imported in the callable you
Review Comment:
```suggestion
* All dependencies that are not available in the Airflow environment must be
locally imported in the callable you
```
##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
---------------------------
Some database migrations can be time-consuming. If your metadata database is
very large, consider pruning some of the old data with the :ref:`db
clean<cli-db-clean>` command prior to performing the upgrade. *Use with
caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies
are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single
set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks
require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services,
there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow
dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community
guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you
can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting
dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined
operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you
use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write
your own Custom Operator,
+you might get to the point where dependencies required by the custom code of
yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators
introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem.
And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a
bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PreexistingPythonVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some
limits and overhead), and
+we will gradually go through those strategies that requires some changes in
your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The
PythonVirtualenvOperator allows you to dynamically
+create virtualenv that your Python callable function will execute in. In the
modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be
done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using
the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have it's own independent Python virtualenv and can specify fine-grained set
of requirements that need
+to be installed for that task to execute.
+
+The operator takes care about:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the
virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via
xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created
before task is run, and
+ removed after it is finished, so there is nothing special (except having
virtualenv package in your
+ airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers -
thus Memory resources are
+ reused (though see below about the CPU overhead involved in creating the
venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the
venvs for you.
+ As DAG Author, you only have to have virtualenv dependency installed and you
can specify and modify the
+ environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or
Docker, or Kubernetes,
+ the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as DAG Author. Only
knowledge of Python, requirements
+ is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by the operator:
+
+* Your python callable has to be serializable. There are a number of python
objects that are not serializable
+ using standard ``pickle`` library. You can mitigate some of those
limitations by using ``dill`` library
+ but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be
locally imported in the callable you
+ use and the top-level Python code of your DAG should not import/use those
libraries.
+* The virtual environments are run in the same operating system, so they
cannot have conflicting system-level
+ dependencies (``apt`` or ``yum`` installable packages). Only Python
dependencies can be independently
+ installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running
each task - Airflow has
+ to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install
dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for
example when your repo is not available
+ or when there is a networking issue with reaching the repository
Review Comment:
```suggestion
or when there is a networking issue with reaching the repository)
```
##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
---------------------------
Some database migrations can be time-consuming. If your metadata database is
very large, consider pruning some of the old data with the :ref:`db
clean<cli-db-clean>` command prior to performing the upgrade. *Use with
caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies
are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single
set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks
require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services,
there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow
dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community
guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you
can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting
dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined
operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you
use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write
your own Custom Operator,
+you might get to the point where dependencies required by the custom code of
yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators
introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem.
And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a
bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PreexistingPythonVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some
limits and overhead), and
+we will gradually go through those strategies that requires some changes in
your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The
PythonVirtualenvOperator allows you to dynamically
+create virtualenv that your Python callable function will execute in. In the
modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be
done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using
the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have it's own independent Python virtualenv and can specify fine-grained set
of requirements that need
+to be installed for that task to execute.
+
+The operator takes care about:
Review Comment:
```suggestion
The operator takes care of:
```
##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
---------------------------
Some database migrations can be time-consuming. If your metadata database is
very large, consider pruning some of the old data with the :ref:`db
clean<cli-db-clean>` command prior to performing the upgrade. *Use with
caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies
are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single
set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks
require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services,
there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow
dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community
guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you
can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting
dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined
operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you
use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write
your own Custom Operator,
+you might get to the point where dependencies required by the custom code of
yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators
introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem.
And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a
bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PreexistingPythonVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some
limits and overhead), and
+we will gradually go through those strategies that requires some changes in
your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The
PythonVirtualenvOperator allows you to dynamically
+create virtualenv that your Python callable function will execute in. In the
modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be
done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using
the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have it's own independent Python virtualenv and can specify fine-grained set
of requirements that need
+to be installed for that task to execute.
+
+The operator takes care about:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the
virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via
xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created
before task is run, and
+ removed after it is finished, so there is nothing special (except having
virtualenv package in your
+ airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers -
thus Memory resources are
+ reused (though see below about the CPU overhead involved in creating the
venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the
venvs for you.
+ As DAG Author, you only have to have virtualenv dependency installed and you
can specify and modify the
+ environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or
Docker, or Kubernetes,
+ the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as DAG Author. Only
knowledge of Python, requirements
+ is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by the operator:
+
+* Your python callable has to be serializable. There are a number of python
objects that are not serializable
+ using standard ``pickle`` library. You can mitigate some of those
limitations by using ``dill`` library
+ but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be
locally imported in the callable you
+ use and the top-level Python code of your DAG should not import/use those
libraries.
Review Comment:
This is a very important point!
##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
---------------------------
Some database migrations can be time-consuming. If your metadata database is
very large, consider pruning some of the old data with the :ref:`db
clean<cli-db-clean>` command prior to performing the upgrade. *Use with
caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies
are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single
set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks
require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services,
there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow
dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community
guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you
can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting
dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined
operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you
use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write
your own Custom Operator,
+you might get to the point where dependencies required by the custom code of
yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators
introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem.
And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a
bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PreexistingPythonVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some
limits and overhead), and
+we will gradually go through those strategies that requires some changes in
your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The
PythonVirtualenvOperator allows you to dynamically
+create virtualenv that your Python callable function will execute in. In the
modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be
done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using
the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have it's own independent Python virtualenv and can specify fine-grained set
of requirements that need
+to be installed for that task to execute.
+
+The operator takes care about:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the
virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via
xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created
before task is run, and
+ removed after it is finished, so there is nothing special (except having
virtualenv package in your
+ airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers -
thus Memory resources are
+ reused (though see below about the CPU overhead involved in creating the
venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the
venvs for you.
+ As DAG Author, you only have to have virtualenv dependency installed and you
can specify and modify the
+ environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or
Docker, or Kubernetes,
+ the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as DAG Author. Only
knowledge of Python, requirements
+ is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by the operator:
+
+* Your python callable has to be serializable. There are a number of python
objects that are not serializable
+ using standard ``pickle`` library. You can mitigate some of those
limitations by using ``dill`` library
+ but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be
locally imported in the callable you
+ use and the top-level Python code of your DAG should not import/use those
libraries.
+* The virtual environments are run in the same operating system, so they
cannot have conflicting system-level
+ dependencies (``apt`` or ``yum`` installable packages). Only Python
dependencies can be independently
+ installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running
each task - Airflow has
+ to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install
dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for
example when your repo is not available
+ or when there is a networking issue with reaching the repository
+* It's easy to fall into a "too" dynamic environment - since the dependencies
you install might get upgraded
+ and their transitive dependencies might get independent upgrades you might
end up with the situation where
+ your task will stop working because someone released a new version of a
dependency or you might fall
+ a victim of "supply chain" attack where new version of a dependency might
become malicious
+* The tasks are only isolated from each other via running in different
environments. This makes it possible
+ that running tasks will still interfere with each other - for example
subsequent tasks executed on the
+ same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PreexistingPythonVirtualenvOperator
+-----------------------------------------
+
+.. versionadded:: 2.4
+
+A bit more complex but with significantly less overhead, security, stability
problems is to use the
+:class:`airflow.operators.python.PreexistingPythonVirtualenvOperator``, or
even better - decorating your callable with
+``@task.preexisting_virtualenv`` decorator. It requires however that the
virtualenv you use is immutable
+by the task and prepared upfront in your environment (and available in all the
workers in case your
Review Comment:
What do you specifically mean by "It requires however that the virtualenv
you use is immutable by the task"?
##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
---------------------------
Some database migrations can be time-consuming. If your metadata database is
very large, consider pruning some of the old data with the :ref:`db
clean<cli-db-clean>` command prior to performing the upgrade. *Use with
caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies
are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single
set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks
require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services,
there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow
dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community
guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you
can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting
dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined
operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you
use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write
your own Custom Operator,
+you might get to the point where dependencies required by the custom code of
yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators
introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem.
And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a
bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PreexistingPythonVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some
limits and overhead), and
+we will gradually go through those strategies that requires some changes in
your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The
PythonVirtualenvOperator allows you to dynamically
+create virtualenv that your Python callable function will execute in. In the
modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be
done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using
the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have it's own independent Python virtualenv and can specify fine-grained set
of requirements that need
+to be installed for that task to execute.
+
+The operator takes care about:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the
virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via
xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created
before task is run, and
+ removed after it is finished, so there is nothing special (except having
virtualenv package in your
+ airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers -
thus Memory resources are
+ reused (though see below about the CPU overhead involved in creating the
venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the
venvs for you.
+ As DAG Author, you only have to have virtualenv dependency installed and you
can specify and modify the
+ environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or
Docker, or Kubernetes,
+ the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as DAG Author. Only
knowledge of Python, requirements
+ is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by the operator:
+
+* Your python callable has to be serializable. There are a number of python
objects that are not serializable
+ using standard ``pickle`` library. You can mitigate some of those
limitations by using ``dill`` library
+ but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be
locally imported in the callable you
+ use and the top-level Python code of your DAG should not import/use those
libraries.
+* The virtual environments are run in the same operating system, so they
cannot have conflicting system-level
+ dependencies (``apt`` or ``yum`` installable packages). Only Python
dependencies can be independently
+ installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running
each task - Airflow has
+ to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install
dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for
example when your repo is not available
+ or when there is a networking issue with reaching the repository
+* It's easy to fall into a "too" dynamic environment - since the dependencies
you install might get upgraded
+ and their transitive dependencies might get independent upgrades you might
end up with the situation where
+ your task will stop working because someone released a new version of a
dependency or you might fall
+ a victim of "supply chain" attack where new version of a dependency might
become malicious
+* The tasks are only isolated from each other via running in different
environments. This makes it possible
+ that running tasks will still interfere with each other - for example
subsequent tasks executed on the
+ same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PreexistingPythonVirtualenvOperator
+-----------------------------------------
+
+.. versionadded:: 2.4
+
+A bit more complex but with significantly less overhead, security, stability
problems is to use the
+:class:`airflow.operators.python.PreexistingPythonVirtualenvOperator``, or
even better - decorating your callable with
+``@task.preexisting_virtualenv`` decorator. It requires however that the
virtualenv you use is immutable
+by the task and prepared upfront in your environment (and available in all the
workers in case your
+Airflow runs in a distributed environments). This way you avoid the overhead
and problems of re-creating the
Review Comment:
```suggestion
Airflow runs in a distributed environment). This way you avoid the overhead
and problems of re-creating the
```
##########
docs/apache-airflow/best-practices.rst:
##########
@@ -20,10 +20,11 @@
Best Practices
==============
-Creating a new DAG is a two-step process:
+Creating a new DAG is a three-step process:
- writing Python code to create a DAG object,
-- testing if the code meets our expectations
+- testing if the code meets our expectations,
+- running the DAG in production
This tutorial will introduce you to the best practices for these two steps.
Review Comment:
```suggestion
This tutorial will introduce you to the best practices for these three steps.
```
Are there other spots in the text below as well that need to be adapted now
that there are three points?
##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
---------------------------
Some database migrations can be time-consuming. If your metadata database is
very large, consider pruning some of the old data with the :ref:`db
clean<cli-db-clean>` command prior to performing the upgrade. *Use with
caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies
are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single
set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks
require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services,
there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow
dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community
guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you
can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting
dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined
operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you
use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write
your own Custom Operator,
+you might get to the point where dependencies required by the custom code of
yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators
introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem.
And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a
bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PreexistingPythonVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some
limits and overhead), and
+we will gradually go through those strategies that requires some changes in
your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The
PythonVirtualenvOperator allows you to dynamically
+create virtualenv that your Python callable function will execute in. In the
modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be
done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using
the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have it's own independent Python virtualenv and can specify fine-grained set
of requirements that need
+to be installed for that task to execute.
+
+The operator takes care about:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the
virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via
xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created
before task is run, and
+ removed after it is finished, so there is nothing special (except having
virtualenv package in your
+ airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers -
thus Memory resources are
+ reused (though see below about the CPU overhead involved in creating the
venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the
venvs for you.
+ As DAG Author, you only have to have virtualenv dependency installed and you
can specify and modify the
+ environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or
Docker, or Kubernetes,
+ the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as DAG Author. Only
knowledge of Python, requirements
+ is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by the operator:
+
+* Your python callable has to be serializable. There are a number of python
objects that are not serializable
+ using standard ``pickle`` library. You can mitigate some of those
limitations by using ``dill`` library
+ but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be
locally imported in the callable you
+ use and the top-level Python code of your DAG should not import/use those
libraries.
+* The virtual environments are run in the same operating system, so they
cannot have conflicting system-level
+ dependencies (``apt`` or ``yum`` installable packages). Only Python
dependencies can be independently
+ installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running
each task - Airflow has
+ to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install
dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for
example when your repo is not available
+ or when there is a networking issue with reaching the repository
+* It's easy to fall into a "too" dynamic environment - since the dependencies
you install might get upgraded
+ and their transitive dependencies might get independent upgrades you might
end up with the situation where
+ your task will stop working because someone released a new version of a
dependency or you might fall
+ a victim of "supply chain" attack where new version of a dependency might
become malicious
+* The tasks are only isolated from each other via running in different
environments. This makes it possible
+ that running tasks will still interfere with each other - for example
subsequent tasks executed on the
+ same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PreexistingPythonVirtualenvOperator
+-----------------------------------------
+
+.. versionadded:: 2.4
+
+A bit more complex but with significantly less overhead, security, stability
problems is to use the
+:class:`airflow.operators.python.PreexistingPythonVirtualenvOperator``, or
even better - decorating your callable with
+``@task.preexisting_virtualenv`` decorator. It requires however that the
virtualenv you use is immutable
+by the task and prepared upfront in your environment (and available in all the
workers in case your
+Airflow runs in a distributed environments). This way you avoid the overhead
and problems of re-creating the
+virtual environment but they have to be prepared and deployed together with
Airflow installation, so usually people
+who manage Airflow installation need to be involved (and in bigger
installation those are usually different
+people than DAG Authors (DevOps/System Admins).
+
+Those virtual environments can be prepared in various ways - if you use
LocalExecutor they just need to be installed
+at the machine where scheduler is run, if you are using distributed Celery
virtualenv installations, there
+should be a pipeline that installs those virtual environments across multiple
machines, finally if you are using
+Docker Image (for example via Kubernetes), the virtualenv creation should be
added to the pipeline of
+your custom image building.
+
+The benefits of the operator are:
+
+* No setup overhead when running the task. The virtualenv is ready when you
start running a task.
+* You can run tasks with different sets of dependencies on the same workers -
thus all resources are reused.
+* There is no need to have access by workers to PyPI or private repositories.
Less chance for transient
+ errors resulting from networking.
+* The dependencies can be pre-vetted by the admins and your security team, no
unexpected, new code will
+ be added dynamically. This is good for both, security and stability.
+* Limited impact on your deployment - you do not need to switch to Docker
containers or Kubernetes to
+ make a good use of the operator.
+* No need to learn more about containers, Kubernetes as DAG Author. Only
knowledge of Python, requirements
+ is required to author DAGs this way.
+
+The drawbacks:
+
+* Your environment needs to have the virtual environments prepared upfront.
This usually means that you
+ cannot change it on the flight, adding new or changing requirements require
at least airflow re-deployment
+ and iteration time when you work on new versions might be longer.
+* Your python callable has to be serializable. There are a number of python
objects that are not serializable
+ using standard ``pickle`` library. You can mitigate some of those
limitations by using ``dill`` library
+ but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be
locally imported in the callable you
+ use and the top-level Python code of your DAG should not import/use those
libraries.
+* The virtual environments are run in the same operating system, so they
cannot have conflicting system-level
+ dependencies (``apt`` or ``yum`` installable packages). Only Python
dependencies can be independently
+ installed in those environments
+* The tasks are only isolated from each other via running in different
environments. This makes it possible
+ that running tasks will still interfere with each other - for example
subsequent tasks executed on the
+ same worker might be affected by previous tasks creating/modifying files et.c
+
+Actually, you can think about the ``PythonVirtualenvOperator`` and
``PreexistingPythonVirtualenvOperator``
Review Comment:
```suggestion
Actually, you can think about the ``PythonVirtualenvOperator`` and
``PythonPreexistingVirtualenvOperator``
```
##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
---------------------------
Some database migrations can be time-consuming. If your metadata database is
very large, consider pruning some of the old data with the :ref:`db
clean<cli-db-clean>` command prior to performing the upgrade. *Use with
caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies
are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single
set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks
require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services,
there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow
dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community
guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you
can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting
dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined
operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you
use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write
your own Custom Operator,
+you might get to the point where dependencies required by the custom code of
yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators
introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem.
And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a
bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PreexistingPythonVirtualenvOperator``.
Review Comment:
```suggestion
``PythonPreexistingVirtualenvOperator``.
```
##########
docs/apache-airflow/best-practices.rst:
##########
@@ -20,10 +20,11 @@
Best Practices
==============
-Creating a new DAG is a two-step process:
+Creating a new DAG is a three-step process:
- writing Python code to create a DAG object,
-- testing if the code meets our expectations
+- testing if the code meets our expectations,
+- running the DAG in production
Review Comment:
Something about "running in prod" feels strange to me (does your dag ever
have to be in prod to be completely created?)
Maybe generalize this to `- configuring environment dependencies to run your
DAG`
##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
---------------------------
Some database migrations can be time-consuming. If your metadata database is
very large, consider pruning some of the old data with the :ref:`db
clean<cli-db-clean>` command prior to performing the upgrade. *Use with
caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies
are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single
set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks
require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services,
there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow
dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community
guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you
can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting
dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined
operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you
use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write
your own Custom Operator,
+you might get to the point where dependencies required by the custom code of
yours are conflicting with those
Review Comment:
```suggestion
you might get to the point where the dependencies required by the custom
code of yours are conflicting with those
```
##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
---------------------------
Some database migrations can be time-consuming. If your metadata database is
very large, consider pruning some of the old data with the :ref:`db
clean<cli-db-clean>` command prior to performing the upgrade. *Use with
caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies
are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single
set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks
require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services,
there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow
dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community
guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you
can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting
dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined
operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you
use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write
your own Custom Operator,
Review Comment:
```suggestion
your operators are written using custom python code, or when you want to
write your own Custom Operator,
```
##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
---------------------------
Some database migrations can be time-consuming. If your metadata database is
very large, consider pruning some of the old data with the :ref:`db
clean<cli-db-clean>` command prior to performing the upgrade. *Use with
caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies
are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single
set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks
require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services,
there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow
dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community
guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you
can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting
dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined
operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you
use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write
your own Custom Operator,
+you might get to the point where dependencies required by the custom code of
yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators
introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem.
And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a
bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PreexistingPythonVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some
limits and overhead), and
+we will gradually go through those strategies that requires some changes in
your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The
PythonVirtualenvOperator allows you to dynamically
+create virtualenv that your Python callable function will execute in. In the
modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be
done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using
the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have it's own independent Python virtualenv and can specify fine-grained set
of requirements that need
+to be installed for that task to execute.
+
+The operator takes care about:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the
virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via
xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created
before task is run, and
+ removed after it is finished, so there is nothing special (except having
virtualenv package in your
+ airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers -
thus Memory resources are
+ reused (though see below about the CPU overhead involved in creating the
venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the
venvs for you.
+ As DAG Author, you only have to have virtualenv dependency installed and you
can specify and modify the
Review Comment:
```suggestion
As a DAG Author, you only have to have virtualenv dependency installed and
you can specify and modify the
```
##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
---------------------------
Some database migrations can be time-consuming. If your metadata database is
very large, consider pruning some of the old data with the :ref:`db
clean<cli-db-clean>` command prior to performing the upgrade. *Use with
caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies
are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single
set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks
require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services,
there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow
dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community
guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you
can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting
dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined
operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you
use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write
your own Custom Operator,
+you might get to the point where dependencies required by the custom code of
yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators
introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem.
And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a
bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PreexistingPythonVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some
limits and overhead), and
+we will gradually go through those strategies that requires some changes in
your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The
PythonVirtualenvOperator allows you to dynamically
+create virtualenv that your Python callable function will execute in. In the
modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be
done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using
the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have it's own independent Python virtualenv and can specify fine-grained set
of requirements that need
+to be installed for that task to execute.
+
+The operator takes care about:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the
virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via
xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created
before task is run, and
+ removed after it is finished, so there is nothing special (except having
virtualenv package in your
+ airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers -
thus Memory resources are
+ reused (though see below about the CPU overhead involved in creating the
venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the
venvs for you.
+ As DAG Author, you only have to have virtualenv dependency installed and you
can specify and modify the
+ environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or
Docker, or Kubernetes,
+ the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as DAG Author. Only
knowledge of Python, requirements
+ is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by the operator:
+
+* Your python callable has to be serializable. There are a number of python
objects that are not serializable
+ using standard ``pickle`` library. You can mitigate some of those
limitations by using ``dill`` library
+ but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be
locally imported in the callable you
+ use and the top-level Python code of your DAG should not import/use those
libraries.
+* The virtual environments are run in the same operating system, so they
cannot have conflicting system-level
+ dependencies (``apt`` or ``yum`` installable packages). Only Python
dependencies can be independently
+ installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running
each task - Airflow has
+ to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install
dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for
example when your repo is not available
+ or when there is a networking issue with reaching the repository
+* It's easy to fall into a "too" dynamic environment - since the dependencies
you install might get upgraded
+ and their transitive dependencies might get independent upgrades you might
end up with the situation where
+ your task will stop working because someone released a new version of a
dependency or you might fall
+ a victim of "supply chain" attack where new version of a dependency might
become malicious
+* The tasks are only isolated from each other via running in different
environments. This makes it possible
+ that running tasks will still interfere with each other - for example
subsequent tasks executed on the
+ same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PreexistingPythonVirtualenvOperator
+-----------------------------------------
+
+.. versionadded:: 2.4
+
+A bit more complex but with significantly less overhead, security, stability
problems is to use the
+:class:`airflow.operators.python.PreexistingPythonVirtualenvOperator``, or
even better - decorating your callable with
+``@task.preexisting_virtualenv`` decorator. It requires however that the
virtualenv you use is immutable
+by the task and prepared upfront in your environment (and available in all the
workers in case your
+Airflow runs in a distributed environments). This way you avoid the overhead
and problems of re-creating the
+virtual environment but they have to be prepared and deployed together with
Airflow installation, so usually people
+who manage Airflow installation need to be involved (and in bigger
installation those are usually different
+people than DAG Authors (DevOps/System Admins).
+
+Those virtual environments can be prepared in various ways - if you use
LocalExecutor they just need to be installed
+at the machine where scheduler is run, if you are using distributed Celery
virtualenv installations, there
+should be a pipeline that installs those virtual environments across multiple
machines, finally if you are using
+Docker Image (for example via Kubernetes), the virtualenv creation should be
added to the pipeline of
+your custom image building.
+
+The benefits of the operator are:
+
+* No setup overhead when running the task. The virtualenv is ready when you
start running a task.
+* You can run tasks with different sets of dependencies on the same workers -
thus all resources are reused.
+* There is no need to have access by workers to PyPI or private repositories.
Less chance for transient
+ errors resulting from networking.
+* The dependencies can be pre-vetted by the admins and your security team, no
unexpected, new code will
+ be added dynamically. This is good for both, security and stability.
+* Limited impact on your deployment - you do not need to switch to Docker
containers or Kubernetes to
+ make a good use of the operator.
+* No need to learn more about containers, Kubernetes as DAG Author. Only
knowledge of Python, requirements
+ is required to author DAGs this way.
+
+The drawbacks:
+
+* Your environment needs to have the virtual environments prepared upfront.
This usually means that you
+ cannot change it on the flight, adding new or changing requirements require
at least airflow re-deployment
Review Comment:
```suggestion
cannot change it on the fly, adding new or changing requirements require
at least an Airflow re-deployment
```
##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
---------------------------
Some database migrations can be time-consuming. If your metadata database is
very large, consider pruning some of the old data with the :ref:`db
clean<cli-db-clean>` command prior to performing the upgrade. *Use with
caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies
are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single
set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks
require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services,
there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow
dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community
guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you
can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting
dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined
operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you
use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write
your own Custom Operator,
+you might get to the point where dependencies required by the custom code of
yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators
introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem.
And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a
bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PreexistingPythonVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some
limits and overhead), and
+we will gradually go through those strategies that requires some changes in
your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The
PythonVirtualenvOperator allows you to dynamically
+create virtualenv that your Python callable function will execute in. In the
modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be
done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using
the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have it's own independent Python virtualenv and can specify fine-grained set
of requirements that need
+to be installed for that task to execute.
+
+The operator takes care about:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the
virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via
xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created
before task is run, and
+ removed after it is finished, so there is nothing special (except having
virtualenv package in your
+ airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers -
thus Memory resources are
+ reused (though see below about the CPU overhead involved in creating the
venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the
venvs for you.
+ As DAG Author, you only have to have virtualenv dependency installed and you
can specify and modify the
+ environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or
Docker, or Kubernetes,
+ the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as DAG Author. Only
knowledge of Python, requirements
+ is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by the operator:
+
+* Your python callable has to be serializable. There are a number of python
objects that are not serializable
+ using standard ``pickle`` library. You can mitigate some of those
limitations by using ``dill`` library
+ but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be
locally imported in the callable you
+ use and the top-level Python code of your DAG should not import/use those
libraries.
+* The virtual environments are run in the same operating system, so they
cannot have conflicting system-level
+ dependencies (``apt`` or ``yum`` installable packages). Only Python
dependencies can be independently
+ installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running
each task - Airflow has
+ to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install
dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for
example when your repo is not available
+ or when there is a networking issue with reaching the repository
+* It's easy to fall into a "too" dynamic environment - since the dependencies
you install might get upgraded
+ and their transitive dependencies might get independent upgrades you might
end up with the situation where
+ your task will stop working because someone released a new version of a
dependency or you might fall
+ a victim of "supply chain" attack where new version of a dependency might
become malicious
+* The tasks are only isolated from each other via running in different
environments. This makes it possible
+ that running tasks will still interfere with each other - for example
subsequent tasks executed on the
+ same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PreexistingPythonVirtualenvOperator
Review Comment:
```suggestion
Using PythonPreexistingVirtualenvOperator
```
##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
---------------------------
Some database migrations can be time-consuming. If your metadata database is
very large, consider pruning some of the old data with the :ref:`db
clean<cli-db-clean>` command prior to performing the upgrade. *Use with
caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies
are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single
set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks
require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services,
there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow
dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community
guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you
can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting
dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined
operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you
use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write
your own Custom Operator,
+you might get to the point where dependencies required by the custom code of
yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators
introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem.
And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a
bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PreexistingPythonVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some
limits and overhead), and
+we will gradually go through those strategies that requires some changes in
your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The
PythonVirtualenvOperator allows you to dynamically
+create virtualenv that your Python callable function will execute in. In the
modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be
done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using
the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have it's own independent Python virtualenv and can specify fine-grained set
of requirements that need
Review Comment:
```suggestion
have its own independent Python virtualenv and can specify fine-grained set
of requirements that need
```
##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
---------------------------
Some database migrations can be time-consuming. If your metadata database is
very large, consider pruning some of the old data with the :ref:`db
clean<cli-db-clean>` command prior to performing the upgrade. *Use with
caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies
are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single
set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks
require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services,
there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow
dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community
guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you
can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting
dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined
operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you
use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write
your own Custom Operator,
+you might get to the point where dependencies required by the custom code of
yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators
introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem.
And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a
bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PreexistingPythonVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some
limits and overhead), and
+we will gradually go through those strategies that requires some changes in
your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The
PythonVirtualenvOperator allows you to dynamically
+create virtualenv that your Python callable function will execute in. In the
modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be
done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using
the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have it's own independent Python virtualenv and can specify fine-grained set
of requirements that need
+to be installed for that task to execute.
+
+The operator takes care about:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the
virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via
xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created
before task is run, and
+ removed after it is finished, so there is nothing special (except having
virtualenv package in your
+ airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers -
thus Memory resources are
+ reused (though see below about the CPU overhead involved in creating the
venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the
venvs for you.
+ As DAG Author, you only have to have virtualenv dependency installed and you
can specify and modify the
+ environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or
Docker, or Kubernetes,
+ the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as DAG Author. Only
knowledge of Python, requirements
+ is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by the operator:
+
+* Your python callable has to be serializable. There are a number of python
objects that are not serializable
+ using standard ``pickle`` library. You can mitigate some of those
limitations by using ``dill`` library
+ but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be
locally imported in the callable you
+ use and the top-level Python code of your DAG should not import/use those
libraries.
+* The virtual environments are run in the same operating system, so they
cannot have conflicting system-level
+ dependencies (``apt`` or ``yum`` installable packages). Only Python
dependencies can be independently
+ installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running
each task - Airflow has
+ to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install
dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for
example when your repo is not available
+ or when there is a networking issue with reaching the repository
+* It's easy to fall into a "too" dynamic environment - since the dependencies
you install might get upgraded
+ and their transitive dependencies might get independent upgrades you might
end up with the situation where
+ your task will stop working because someone released a new version of a
dependency or you might fall
+ a victim of "supply chain" attack where new version of a dependency might
become malicious
+* The tasks are only isolated from each other via running in different
environments. This makes it possible
+ that running tasks will still interfere with each other - for example
subsequent tasks executed on the
+ same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PreexistingPythonVirtualenvOperator
+-----------------------------------------
+
+.. versionadded:: 2.4
+
+A bit more complex but with significantly less overhead, security, stability
problems is to use the
+:class:`airflow.operators.python.PreexistingPythonVirtualenvOperator``, or
even better - decorating your callable with
Review Comment:
```suggestion
:class:`airflow.operators.python.PythonPreexistingVirtualenvOperator``, or
even better - decorating your callable with
```
I'm starting to question whether I have it wrong now :smile:
##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
---------------------------
Some database migrations can be time-consuming. If your metadata database is
very large, consider pruning some of the old data with the :ref:`db
clean<cli-db-clean>` command prior to performing the upgrade. *Use with
caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies
are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single
set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks
require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services,
there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow
dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community
guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you
can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting
dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined
operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you
use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write
your own Custom Operator,
+you might get to the point where dependencies required by the custom code of
yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators
introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem.
And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a
bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PreexistingPythonVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some
limits and overhead), and
+we will gradually go through those strategies that requires some changes in
your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The
PythonVirtualenvOperator allows you to dynamically
+create virtualenv that your Python callable function will execute in. In the
modern
Review Comment:
```suggestion
create a virtualenv that your Python callable function will execute in. In
the modern
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]