[GitHub] [airflow] ashb commented on a diff in pull request #25780: Implement PythonPreexistingVirtualenvOperator

GitBox Fri, 26 Aug 2022 03:36:20 -0700


ashb commented on code in PR #25780:
URL: https://github.com/apache/airflow/pull/25780#discussion_r955896696



##########
docs/apache-airflow/howto/operator/python.rst:
##########
@@ -89,6 +89,36 @@ If additional parameters for package installation are needed 
pass them in ``requ
 All supported options are listed in the `requirements file format 
<https://pip.pypa.io/en/stable/reference/requirements-file-format/#supported-options>`_.
 
 
+.. _howto/operator:PythonPreexistingVirtualenvOperator:
+
+PythonPreexistingVirtualenvOperator
+===================================
+
+The PythonPreexistingVirtualenvOperator can help you to run some of your tasks 
with a different set of Python
+libraries than other tasks (and than the main Airflow environment).
+
+Use the :class:`~airflow.operators.python.PythonPreexistingVirtualenvOperator` 
to execute Python callables inside a
+pre-defined virtual environment. The virtualenv should be preinstalled in the 
environment where
+Python is run and in case ``dill`` is used, it has to be preinstalled in the 
virtualenv (the same
+version that is installed in main Airflow environment).
+
+.. exampleinclude:: /../../airflow/example_dags/example_python_operator.py
+    :language: python
+    :dedent: 4
+    :start-after: [START howto_operator_preexisting_virtualenv]
+    :end-before: [END howto_operator_preexisting_virtualenv]
+
+Passing in arguments
+^^^^^^^^^^^^^^^^^^^^
+
+You can use the ``op_args`` and ``op_kwargs`` arguments the same way you use 
it in the PythonOperator.

Review Comment:
   ```suggestion
   You can use the ``op_args`` and ``op_kwargs`` arguments the same way you use 
it with the PythonOperator.
   ```



##########
docs/apache-airflow/howto/operator/python.rst:
##########
@@ -89,6 +89,36 @@ If additional parameters for package installation are needed 
pass them in ``requ
 All supported options are listed in the `requirements file format 
<https://pip.pypa.io/en/stable/reference/requirements-file-format/#supported-options>`_.
 
 
+.. _howto/operator:PythonPreexistingVirtualenvOperator:
+
+PythonPreexistingVirtualenvOperator
+===================================
+
+The PythonPreexistingVirtualenvOperator can help you to run some of your tasks 
with a different set of Python
+libraries than other tasks (and than the main Airflow environment).
+
+Use the :class:`~airflow.operators.python.PythonPreexistingVirtualenvOperator` 
to execute Python callables inside a
+pre-defined virtual environment. The virtualenv should be preinstalled in the 
environment where
+Python is run and in case ``dill`` is used, it has to be preinstalled in the 
virtualenv (the same

Review Comment:
   ```suggestion
   Python is run and in case ``dill`` is used, it has to be installed in the 
virtualenv (the same
   ```



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,221 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is 
very large, consider pruning some of the old data with the :ref:`db 
clean<cli-db-clean>` command prior to performing the upgrade.  *Use with 
caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies 
are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single 
set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks 
require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, 
there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow 
dependencies. Airflow uses constraints mechanism
+which means that you have a "fixed" set of dependencies that the community 
guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you 
can upgrade the providers
+independently and their constraints do not limit you so the chance of a 
conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined 
operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you 
use TaskFlow Api and most of
+your operators are written using custom python code, or when you want to write 
your own Custom Operator,
+you might get to the point where the dependencies required by the custom code 
of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators 
introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. 
And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a 
bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PythonPreexistingVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some 
limits and overhead), and
+we will gradually go through those strategies that requires some changes in 
your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The 
PythonVirtualenvOperator allows you to dynamically
+create a virtualenv that your Python callable function will execute in. In the 
modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be 
done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using 
the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have its own independent Python virtualenv and can specify fine-grained set of 
requirements that need
+to be installed for that task to execute.
+
+The operator takes care of:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the 
virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via 
xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created 
before task is run, and
+  removed after it is finished, so there is nothing special (except having 
virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - 
thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the 
venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the 
venvs for you.
+  As a DAG Author, you only have to have virtualenv dependency installed and 
you can specify and modify the
+  environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or 
Docker, or Kubernetes,
+  the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as a DAG Author. Only 
knowledge of Python requirements
+  is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by this operator:
+
+* Your python callable has to be serializable. There are a number of python 
objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those 
limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in the Airflow environment must be 
locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those 
libraries.
+* The virtual environments are run in the same operating system, so they 
cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python 
dependencies can be independently
+  installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running 
each task - Airflow has
+  to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install 
dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for 
example when your repo is not available
+  or when there is a networking issue with reaching the repository)
+* It's easy to  fall into a "too" dynamic environment - since the dependencies 
you install might get upgraded
+  and their transitive dependencies might get independent upgrades you might 
end up with the situation where
+  your task will stop working because someone released a new version of a 
dependency or you might fall
+  a victim of "supply chain" attack where new version of a dependency might 
become malicious
+* The tasks are only isolated from each other via running in different 
environments. This makes it possible
+  that running tasks will still interfere with each other - for example 
subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c

Review Comment:
   ```suggestion
     same worker might be affected by previous tasks creating/modifying files 
etc.
   ```



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,221 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is 
very large, consider pruning some of the old data with the :ref:`db 
clean<cli-db-clean>` command prior to performing the upgrade.  *Use with 
caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies 
are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single 
set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks 
require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, 
there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow 
dependencies. Airflow uses constraints mechanism
+which means that you have a "fixed" set of dependencies that the community 
guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you 
can upgrade the providers
+independently and their constraints do not limit you so the chance of a 
conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined 
operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you 
use TaskFlow Api and most of
+your operators are written using custom python code, or when you want to write 
your own Custom Operator,
+you might get to the point where the dependencies required by the custom code 
of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators 
introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. 
And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a 
bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PythonPreexistingVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some 
limits and overhead), and
+we will gradually go through those strategies that requires some changes in 
your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The 
PythonVirtualenvOperator allows you to dynamically
+create a virtualenv that your Python callable function will execute in. In the 
modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be 
done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using 
the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have its own independent Python virtualenv and can specify fine-grained set of 
requirements that need
+to be installed for that task to execute.
+
+The operator takes care of:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the 
virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via 
xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created 
before task is run, and
+  removed after it is finished, so there is nothing special (except having 
virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - 
thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the 
venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the 
venvs for you.
+  As a DAG Author, you only have to have virtualenv dependency installed and 
you can specify and modify the
+  environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or 
Docker, or Kubernetes,
+  the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as a DAG Author. Only 
knowledge of Python requirements
+  is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by this operator:
+
+* Your python callable has to be serializable. There are a number of python 
objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those 
limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in the Airflow environment must be 
locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those 
libraries.
+* The virtual environments are run in the same operating system, so they 
cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python 
dependencies can be independently
+  installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running 
each task - Airflow has
+  to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install 
dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for 
example when your repo is not available
+  or when there is a networking issue with reaching the repository)
+* It's easy to  fall into a "too" dynamic environment - since the dependencies 
you install might get upgraded
+  and their transitive dependencies might get independent upgrades you might 
end up with the situation where
+  your task will stop working because someone released a new version of a 
dependency or you might fall
+  a victim of "supply chain" attack where new version of a dependency might 
become malicious
+* The tasks are only isolated from each other via running in different 
environments. This makes it possible
+  that running tasks will still interfere with each other - for example 
subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PythonPreexistingVirtualenvOperator
+-----------------------------------------
+
+.. versionadded:: 2.4
+
+A bit more complex but with significantly less overhead, security, stability 
problems is to use the
+:class:`airflow.operators.python.PythonPreexistingVirtualenvOperator``, or 
even better - decorating your callable with
+``@task.preexisting_virtualenv`` decorator. It requires however that the 
virtualenv you use is immutable.
+The ``immutable`` in this context means that (unlike in 
:class:`airflow.operators.python.PythonVirtualenvOperator`)
+you cannot add new dependencies to such pre-existing virtualenv. All 
dependencies you need should be added
+upfront in your environment (and available in all the workers in case your 
Airflow runs in a distributed
+environment). This way you avoid the overhead and problems of re-creating the
+virtual environment but they have to be prepared and deployed together with 
Airflow installation, so usually people
+who manage Airflow installation need to be involved (and in bigger 
installations those are usually different
+people than DAG Authors (DevOps/System Admins).
+
+Those virtual environments can be prepared in various ways - if you use 
LocalExecutor they just need to be installed
+at the machine where scheduler is run, if you are using distributed Celery 
virtualenv installations, there
+should be a pipeline that installs those virtual environments across multiple 
machines, finally if you are using
+Docker Image (for example via Kubernetes), the virtualenv creation should be 
added to the pipeline of
+your custom image building.
+
+The benefits of the operator are:
+
+* No setup overhead when running the task. The virtualenv is ready when you 
start running a task.
+* You can run tasks with different sets of dependencies on the same workers - 
thus all resources are reused.

Review Comment:
   What resources? I don't understand that final bit.



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,221 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is 
very large, consider pruning some of the old data with the :ref:`db 
clean<cli-db-clean>` command prior to performing the upgrade.  *Use with 
caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies 
are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single 
set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks 
require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, 
there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow 
dependencies. Airflow uses constraints mechanism
+which means that you have a "fixed" set of dependencies that the community 
guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you 
can upgrade the providers
+independently and their constraints do not limit you so the chance of a 
conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined 
operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you 
use TaskFlow Api and most of
+your operators are written using custom python code, or when you want to write 
your own Custom Operator,
+you might get to the point where the dependencies required by the custom code 
of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators 
introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. 
And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a 
bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PythonPreexistingVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some 
limits and overhead), and
+we will gradually go through those strategies that requires some changes in 
your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The 
PythonVirtualenvOperator allows you to dynamically
+create a virtualenv that your Python callable function will execute in. In the 
modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be 
done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using 
the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have its own independent Python virtualenv and can specify fine-grained set of 
requirements that need
+to be installed for that task to execute.
+
+The operator takes care of:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the 
virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via 
xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created 
before task is run, and
+  removed after it is finished, so there is nothing special (except having 
virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - 
thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the 
venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the 
venvs for you.
+  As a DAG Author, you only have to have virtualenv dependency installed and 
you can specify and modify the
+  environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or 
Docker, or Kubernetes,
+  the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as a DAG Author. Only 
knowledge of Python requirements
+  is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by this operator:
+
+* Your python callable has to be serializable. There are a number of python 
objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those 
limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in the Airflow environment must be 
locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those 
libraries.
+* The virtual environments are run in the same operating system, so they 
cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python 
dependencies can be independently
+  installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running 
each task - Airflow has
+  to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install 
dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for 
example when your repo is not available
+  or when there is a networking issue with reaching the repository)
+* It's easy to  fall into a "too" dynamic environment - since the dependencies 
you install might get upgraded
+  and their transitive dependencies might get independent upgrades you might 
end up with the situation where
+  your task will stop working because someone released a new version of a 
dependency or you might fall
+  a victim of "supply chain" attack where new version of a dependency might 
become malicious
+* The tasks are only isolated from each other via running in different 
environments. This makes it possible
+  that running tasks will still interfere with each other - for example 
subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PythonPreexistingVirtualenvOperator
+-----------------------------------------
+
+.. versionadded:: 2.4
+
+A bit more complex but with significantly less overhead, security, stability 
problems is to use the
+:class:`airflow.operators.python.PythonPreexistingVirtualenvOperator``, or 
even better - decorating your callable with
+``@task.preexisting_virtualenv`` decorator. It requires however that the 
virtualenv you use is immutable.
+The ``immutable`` in this context means that (unlike in 
:class:`airflow.operators.python.PythonVirtualenvOperator`)
+you cannot add new dependencies to such pre-existing virtualenv. All 
dependencies you need should be added
+upfront in your environment (and available in all the workers in case your 
Airflow runs in a distributed
+environment). This way you avoid the overhead and problems of re-creating the
+virtual environment but they have to be prepared and deployed together with 
Airflow installation, so usually people
+who manage Airflow installation need to be involved (and in bigger 
installations those are usually different
+people than DAG Authors (DevOps/System Admins).
+
+Those virtual environments can be prepared in various ways - if you use 
LocalExecutor they just need to be installed
+at the machine where scheduler is run, if you are using distributed Celery 
virtualenv installations, there
+should be a pipeline that installs those virtual environments across multiple 
machines, finally if you are using
+Docker Image (for example via Kubernetes), the virtualenv creation should be 
added to the pipeline of
+your custom image building.
+
+The benefits of the operator are:
+
+* No setup overhead when running the task. The virtualenv is ready when you 
start running a task.
+* You can run tasks with different sets of dependencies on the same workers - 
thus all resources are reused.
+* There is no need to have access by workers to PyPI or private repositories. 
Less chance for transient
+  errors resulting from networking.
+* The dependencies can be pre-vetted by the admins and your security team, no 
unexpected, new code will
+  be added dynamically. This is good for both, security and stability.

Review Comment:
   ```suggestion
     be added at task run time. This is good for both security and stability.
   ```



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,221 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is 
very large, consider pruning some of the old data with the :ref:`db 
clean<cli-db-clean>` command prior to performing the upgrade.  *Use with 
caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies 
are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single 
set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks 
require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, 
there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow 
dependencies. Airflow uses constraints mechanism
+which means that you have a "fixed" set of dependencies that the community 
guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you 
can upgrade the providers
+independently and their constraints do not limit you so the chance of a 
conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined 
operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you 
use TaskFlow Api and most of
+your operators are written using custom python code, or when you want to write 
your own Custom Operator,
+you might get to the point where the dependencies required by the custom code 
of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators 
introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. 
And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a 
bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PythonPreexistingVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some 
limits and overhead), and
+we will gradually go through those strategies that requires some changes in 
your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The 
PythonVirtualenvOperator allows you to dynamically
+create a virtualenv that your Python callable function will execute in. In the 
modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be 
done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using 
the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have its own independent Python virtualenv and can specify fine-grained set of 
requirements that need
+to be installed for that task to execute.
+
+The operator takes care of:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the 
virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via 
xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created 
before task is run, and
+  removed after it is finished, so there is nothing special (except having 
virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - 
thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the 
venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the 
venvs for you.
+  As a DAG Author, you only have to have virtualenv dependency installed and 
you can specify and modify the
+  environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or 
Docker, or Kubernetes,
+  the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as a DAG Author. Only 
knowledge of Python requirements
+  is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by this operator:
+
+* Your python callable has to be serializable. There are a number of python 
objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those 
limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in the Airflow environment must be 
locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those 
libraries.
+* The virtual environments are run in the same operating system, so they 
cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python 
dependencies can be independently
+  installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running 
each task - Airflow has
+  to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install 
dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for 
example when your repo is not available
+  or when there is a networking issue with reaching the repository)
+* It's easy to  fall into a "too" dynamic environment - since the dependencies 
you install might get upgraded
+  and their transitive dependencies might get independent upgrades you might 
end up with the situation where
+  your task will stop working because someone released a new version of a 
dependency or you might fall
+  a victim of "supply chain" attack where new version of a dependency might 
become malicious
+* The tasks are only isolated from each other via running in different 
environments. This makes it possible
+  that running tasks will still interfere with each other - for example 
subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PythonPreexistingVirtualenvOperator
+-----------------------------------------
+
+.. versionadded:: 2.4
+
+A bit more complex but with significantly less overhead, security, stability 
problems is to use the
+:class:`airflow.operators.python.PythonPreexistingVirtualenvOperator``, or 
even better - decorating your callable with
+``@task.preexisting_virtualenv`` decorator. It requires however that the 
virtualenv you use is immutable.
+The ``immutable`` in this context means that (unlike in 
:class:`airflow.operators.python.PythonVirtualenvOperator`)
+you cannot add new dependencies to such pre-existing virtualenv. All 
dependencies you need should be added
+upfront in your environment (and available in all the workers in case your 
Airflow runs in a distributed
+environment). This way you avoid the overhead and problems of re-creating the
+virtual environment but they have to be prepared and deployed together with 
Airflow installation, so usually people
+who manage Airflow installation need to be involved (and in bigger 
installations those are usually different
+people than DAG Authors (DevOps/System Admins).
+
+Those virtual environments can be prepared in various ways - if you use 
LocalExecutor they just need to be installed
+at the machine where scheduler is run, if you are using distributed Celery 
virtualenv installations, there
+should be a pipeline that installs those virtual environments across multiple 
machines, finally if you are using
+Docker Image (for example via Kubernetes), the virtualenv creation should be 
added to the pipeline of
+your custom image building.
+
+The benefits of the operator are:
+
+* No setup overhead when running the task. The virtualenv is ready when you 
start running a task.
+* You can run tasks with different sets of dependencies on the same workers - 
thus all resources are reused.
+* There is no need to have access by workers to PyPI or private repositories. 
Less chance for transient
+  errors resulting from networking.
+* The dependencies can be pre-vetted by the admins and your security team, no 
unexpected, new code will
+  be added dynamically. This is good for both, security and stability.
+* Limited impact on your deployment - you do not need to switch to Docker 
containers or Kubernetes to
+  make a good use of the operator.
+* No need to learn more about containers, Kubernetes as a DAG Author. Only 
knowledge of Python, requirements
+  is required to author DAGs this way.
+
+The drawbacks:
+
+* Your environment needs to have the virtual environments prepared upfront. 
This usually means that you
+  cannot change it on the fly, adding new or changing requirements require at 
least an Airflow re-deployment
+  and iteration time when you work on new versions might be longer.
+* Your python callable has to be serializable. There are a number of python 
objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those 
limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be 
locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those 
libraries.
+* The virtual environments are run in the same operating system, so they 
cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python 
dependencies can be independently
+  installed in those environments
+* The tasks are only isolated from each other via running in different 
environments. This makes it possible
+  that running tasks will still interfere with each other - for example 
subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+Actually, you can think about the ``PythonVirtualenvOperator`` and 
``PythonPreexistingVirtualenvOperator``
+as counterparts - as a DAG author you'd normally iterate with dependencies and 
develop your DAG using
+``PythonVirtualenvOperator`` (thus decorating your tasks with 
``@task.virtualenv`` decorators) while
+after the iteration and changes you would likely want to change it for 
production to switch to
+the ``PythonPreexistingVirtualenvOperator`` after your DevOps/System Admin 
teams deploy your new
+virtualenv to production. The nice thing about this is that you can switch the 
decorator back
+at any time and continue developing it "dynamically" with 
``PythonVirtualenvOperator``.
+
+
+Using DockerOperator or Kubernetes Pod Operator
+-----------------------------------------------
+
+Another strategy is to use the Docker Operator or the Kubernetes Pod Operator. 
Those require that Airflow runs in a
+Docker container environment or Kubernetes environment (or at the very least 
have access to create and
+run tasks with those).
+
+Similarly as in case of Python operators, the taskflow decorators are handy 
for you if you would like to
+use those operators to execute your callable Python code.
+
+However, it is far more involved - you need to understand how 
Docker/Kubernetes Pods work if you want to use
+this approach, but the tasks are fully isolated from each other and you are 
not even limited to running
+Python code. You can write your tasks in any Programming language you want. 
Also your dependencies are
+fully independent from Airflow ones (including the system level dependencies) 
so if your task require
+a very different environment, this is the way to go. Those are 
``@task.docker`` and ``@task.kubernetes``
+decorators.

Review Comment:
   ```suggestion
   Similarly as in case of Python operators, the taskflow decorators 
``@task.docker`` and ``@task.kubernets``` are handy for you if you would like to
   use those operators to execute your callable Python code.
   
   However, it is far more involved - you need to understand how 
Docker/Kubernetes Pods work if you want to use
   this approach, but the tasks are fully isolated from each other and you are 
not even limited to running
   Python code. You can write your tasks in any Programming language you want. 
Also your dependencies are
   fully independent from Airflow ones (including the system level 
dependencies) so if your task require
   a very different environment, this is the way to go.
   ```



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,221 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is 
very large, consider pruning some of the old data with the :ref:`db 
clean<cli-db-clean>` command prior to performing the upgrade.  *Use with 
caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies 
are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single 
set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks 
require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, 
there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow 
dependencies. Airflow uses constraints mechanism
+which means that you have a "fixed" set of dependencies that the community 
guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you 
can upgrade the providers
+independently and their constraints do not limit you so the chance of a 
conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined 
operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you 
use TaskFlow Api and most of
+your operators are written using custom python code, or when you want to write 
your own Custom Operator,
+you might get to the point where the dependencies required by the custom code 
of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators 
introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. 
And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a 
bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PythonPreexistingVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some 
limits and overhead), and
+we will gradually go through those strategies that requires some changes in 
your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The 
PythonVirtualenvOperator allows you to dynamically
+create a virtualenv that your Python callable function will execute in. In the 
modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be 
done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using 
the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have its own independent Python virtualenv and can specify fine-grained set of 
requirements that need
+to be installed for that task to execute.
+
+The operator takes care of:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the 
virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via 
xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created 
before task is run, and
+  removed after it is finished, so there is nothing special (except having 
virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - 
thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the 
venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the 
venvs for you.
+  As a DAG Author, you only have to have virtualenv dependency installed and 
you can specify and modify the
+  environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or 
Docker, or Kubernetes,
+  the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as a DAG Author. Only 
knowledge of Python requirements
+  is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by this operator:
+
+* Your python callable has to be serializable. There are a number of python 
objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those 
limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in the Airflow environment must be 
locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those 
libraries.
+* The virtual environments are run in the same operating system, so they 
cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python 
dependencies can be independently
+  installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running 
each task - Airflow has
+  to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install 
dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for 
example when your repo is not available
+  or when there is a networking issue with reaching the repository)
+* It's easy to  fall into a "too" dynamic environment - since the dependencies 
you install might get upgraded
+  and their transitive dependencies might get independent upgrades you might 
end up with the situation where
+  your task will stop working because someone released a new version of a 
dependency or you might fall
+  a victim of "supply chain" attack where new version of a dependency might 
become malicious
+* The tasks are only isolated from each other via running in different 
environments. This makes it possible
+  that running tasks will still interfere with each other - for example 
subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PythonPreexistingVirtualenvOperator
+-----------------------------------------
+
+.. versionadded:: 2.4
+
+A bit more complex but with significantly less overhead, security, stability 
problems is to use the
+:class:`airflow.operators.python.PythonPreexistingVirtualenvOperator``, or 
even better - decorating your callable with
+``@task.preexisting_virtualenv`` decorator. It requires however that the 
virtualenv you use is immutable.
+The ``immutable`` in this context means that (unlike in 
:class:`airflow.operators.python.PythonVirtualenvOperator`)
+you cannot add new dependencies to such pre-existing virtualenv. All 
dependencies you need should be added
+upfront in your environment (and available in all the workers in case your 
Airflow runs in a distributed
+environment). This way you avoid the overhead and problems of re-creating the
+virtual environment but they have to be prepared and deployed together with 
Airflow installation, so usually people
+who manage Airflow installation need to be involved (and in bigger 
installations those are usually different
+people than DAG Authors (DevOps/System Admins).
+
+Those virtual environments can be prepared in various ways - if you use 
LocalExecutor they just need to be installed
+at the machine where scheduler is run, if you are using distributed Celery 
virtualenv installations, there
+should be a pipeline that installs those virtual environments across multiple 
machines, finally if you are using
+Docker Image (for example via Kubernetes), the virtualenv creation should be 
added to the pipeline of
+your custom image building.
+
+The benefits of the operator are:
+
+* No setup overhead when running the task. The virtualenv is ready when you 
start running a task.
+* You can run tasks with different sets of dependencies on the same workers - 
thus all resources are reused.
+* There is no need to have access by workers to PyPI or private repositories. 
Less chance for transient
+  errors resulting from networking.
+* The dependencies can be pre-vetted by the admins and your security team, no 
unexpected, new code will
+  be added dynamically. This is good for both, security and stability.
+* Limited impact on your deployment - you do not need to switch to Docker 
containers or Kubernetes to
+  make a good use of the operator.
+* No need to learn more about containers, Kubernetes as a DAG Author. Only 
knowledge of Python, requirements
+  is required to author DAGs this way.
+
+The drawbacks:
+
+* Your environment needs to have the virtual environments prepared upfront. 
This usually means that you
+  cannot change it on the fly, adding new or changing requirements require at 
least an Airflow re-deployment
+  and iteration time when you work on new versions might be longer.
+* Your python callable has to be serializable. There are a number of python 
objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those 
limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be 
locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those 
libraries.
+* The virtual environments are run in the same operating system, so they 
cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python 
dependencies can be independently
+  installed in those environments
+* The tasks are only isolated from each other via running in different 
environments. This makes it possible
+  that running tasks will still interfere with each other - for example 
subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+Actually, you can think about the ``PythonVirtualenvOperator`` and 
``PythonPreexistingVirtualenvOperator``
+as counterparts - as a DAG author you'd normally iterate with dependencies and 
develop your DAG using
+``PythonVirtualenvOperator`` (thus decorating your tasks with 
``@task.virtualenv`` decorators) while
+after the iteration and changes you would likely want to change it for 
production to switch to
+the ``PythonPreexistingVirtualenvOperator`` after your DevOps/System Admin 
teams deploy your new
+virtualenv to production. The nice thing about this is that you can switch the 
decorator back
+at any time and continue developing it "dynamically" with 
``PythonVirtualenvOperator``.
+
+
+Using DockerOperator or Kubernetes Pod Operator
+-----------------------------------------------
+
+Another strategy is to use the Docker Operator or the Kubernetes Pod Operator. 
Those require that Airflow runs in a
+Docker container environment or Kubernetes environment (or at the very least 
have access to create and
+run tasks with those).
+
+Similarly as in case of Python operators, the taskflow decorators are handy 
for you if you would like to
+use those operators to execute your callable Python code.
+
+However, it is far more involved - you need to understand how 
Docker/Kubernetes Pods work if you want to use
+this approach, but the tasks are fully isolated from each other and you are 
not even limited to running
+Python code. You can write your tasks in any Programming language you want. 
Also your dependencies are
+fully independent from Airflow ones (including the system level dependencies) 
so if your task require
+a very different environment, this is the way to go. Those are 
``@task.docker`` and ``@task.kubernetes``
+decorators.
+
+The benefits of those operators are:

Review Comment:
   ```suggestion
   The benefits of these operators are:
   ```



##########
docs/apache-airflow/howto/operator/python.rst:
##########
@@ -89,6 +89,36 @@ If additional parameters for package installation are needed 
pass them in ``requ
 All supported options are listed in the `requirements file format 
<https://pip.pypa.io/en/stable/reference/requirements-file-format/#supported-options>`_.
 
 
+.. _howto/operator:PythonPreexistingVirtualenvOperator:
+
+PythonPreexistingVirtualenvOperator
+===================================
+
+The PythonPreexistingVirtualenvOperator can help you to run some of your tasks 
with a different set of Python
+libraries than other tasks (and than the main Airflow environment).
+
+Use the :class:`~airflow.operators.python.PythonPreexistingVirtualenvOperator` 
to execute Python callables inside a
+pre-defined virtual environment. The virtualenv should be preinstalled in the 
environment where
+Python is run and in case ``dill`` is used, it has to be preinstalled in the 
virtualenv (the same
+version that is installed in main Airflow environment).
+
+.. exampleinclude:: /../../airflow/example_dags/example_python_operator.py
+    :language: python
+    :dedent: 4
+    :start-after: [START howto_operator_preexisting_virtualenv]
+    :end-before: [END howto_operator_preexisting_virtualenv]
+
+Passing in arguments
+^^^^^^^^^^^^^^^^^^^^
+
+You can use the ``op_args`` and ``op_kwargs`` arguments the same way you use 
it in the PythonOperator.
+Unfortunately we currently do not support to serialize ``var`` and ``ti`` / 
``task_instance`` due to incompatibilities
+with the underlying library. For Airflow context variables make sure that 
Airflow is also installed as part

Review Comment:
   What does this mean? I can't quite understand it.



##########
docs/apache-airflow/howto/operator/python.rst:
##########
@@ -89,6 +89,36 @@ If additional parameters for package installation are needed 
pass them in ``requ
 All supported options are listed in the `requirements file format 
<https://pip.pypa.io/en/stable/reference/requirements-file-format/#supported-options>`_.
 
 
+.. _howto/operator:PythonPreexistingVirtualenvOperator:
+
+PythonPreexistingVirtualenvOperator
+===================================
+
+The PythonPreexistingVirtualenvOperator can help you to run some of your tasks 
with a different set of Python
+libraries than other tasks (and than the main Airflow environment).
+
+Use the :class:`~airflow.operators.python.PythonPreexistingVirtualenvOperator` 
to execute Python callables inside a
+pre-defined virtual environment. The virtualenv should be preinstalled in the 
environment where
+Python is run and in case ``dill`` is used, it has to be preinstalled in the 
virtualenv (the same
+version that is installed in main Airflow environment).
+
+.. exampleinclude:: /../../airflow/example_dags/example_python_operator.py
+    :language: python
+    :dedent: 4
+    :start-after: [START howto_operator_preexisting_virtualenv]
+    :end-before: [END howto_operator_preexisting_virtualenv]
+
+Passing in arguments
+^^^^^^^^^^^^^^^^^^^^
+
+You can use the ``op_args`` and ``op_kwargs`` arguments the same way you use 
it in the PythonOperator.
+Unfortunately we currently do not support to serialize ``var`` and ``ti`` / 
``task_instance`` due to incompatibilities
+with the underlying library. For Airflow context variables make sure that 
Airflow is also installed as part
+of the virtualenv environment in the same version as the Airflow version the 
task is run on.
+Otherwise you won't have access to the most context variables of Airflow in 
``op_kwargs``.
+If you want the context related to datetime objects like 
``data_interval_start`` you can add ``pendulum`` and
+``lazy_object_proxy`` to your virtualenv.

Review Comment:
   Hmmm, what is lazy_object_proxy needed for? This one feels like it shouldn't 
be required for "most" users.



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,221 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is 
very large, consider pruning some of the old data with the :ref:`db 
clean<cli-db-clean>` command prior to performing the upgrade.  *Use with 
caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies 
are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single 
set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks 
require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, 
there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow 
dependencies. Airflow uses constraints mechanism
+which means that you have a "fixed" set of dependencies that the community 
guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you 
can upgrade the providers
+independently and their constraints do not limit you so the chance of a 
conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined 
operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you 
use TaskFlow Api and most of
+your operators are written using custom python code, or when you want to write 
your own Custom Operator,

Review Comment:
   I don't like this framing -- it paints pre-packaged operators as legacy and 
to-be-avoided, which they are not. Please reword this.



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,221 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is 
very large, consider pruning some of the old data with the :ref:`db 
clean<cli-db-clean>` command prior to performing the upgrade.  *Use with 
caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies 
are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single 
set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks 
require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.

Review Comment:
   ```suggestion
   and the dependencies conflict between those tasks.
   ```



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,221 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is 
very large, consider pruning some of the old data with the :ref:`db 
clean<cli-db-clean>` command prior to performing the upgrade.  *Use with 
caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies 
are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single 
set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks 
require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, 
there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow 
dependencies. Airflow uses constraints mechanism
+which means that you have a "fixed" set of dependencies that the community 
guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you 
can upgrade the providers
+independently and their constraints do not limit you so the chance of a 
conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined 
operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you 
use TaskFlow Api and most of
+your operators are written using custom python code, or when you want to write 
your own Custom Operator,
+you might get to the point where the dependencies required by the custom code 
of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators 
introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. 
And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a 
bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PythonPreexistingVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some 
limits and overhead), and
+we will gradually go through those strategies that requires some changes in 
your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The 
PythonVirtualenvOperator allows you to dynamically
+create a virtualenv that your Python callable function will execute in. In the 
modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be 
done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using 
the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have its own independent Python virtualenv and can specify fine-grained set of 
requirements that need
+to be installed for that task to execute.
+
+The operator takes care of:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the 
virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via 
xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created 
before task is run, and
+  removed after it is finished, so there is nothing special (except having 
virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - 
thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the 
venvs).

Review Comment:
   I don't think the point here about memory being reused is true -- since it's 
a new process each new venv has it's own copy of python and modules loaded -- 
nothing is shared.



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,221 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is 
very large, consider pruning some of the old data with the :ref:`db 
clean<cli-db-clean>` command prior to performing the upgrade.  *Use with 
caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies 
are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single 
set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks 
require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, 
there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow 
dependencies. Airflow uses constraints mechanism
+which means that you have a "fixed" set of dependencies that the community 
guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you 
can upgrade the providers
+independently and their constraints do not limit you so the chance of a 
conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined 
operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you 
use TaskFlow Api and most of
+your operators are written using custom python code, or when you want to write 
your own Custom Operator,
+you might get to the point where the dependencies required by the custom code 
of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators 
introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. 
And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a 
bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PythonPreexistingVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some 
limits and overhead), and
+we will gradually go through those strategies that requires some changes in 
your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The 
PythonVirtualenvOperator allows you to dynamically
+create a virtualenv that your Python callable function will execute in. In the 
modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be 
done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using 
the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have its own independent Python virtualenv and can specify fine-grained set of 
requirements that need
+to be installed for that task to execute.
+
+The operator takes care of:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the 
virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via 
xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created 
before task is run, and
+  removed after it is finished, so there is nothing special (except having 
virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - 
thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the 
venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the 
venvs for you.
+  As a DAG Author, you only have to have virtualenv dependency installed and 
you can specify and modify the
+  environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or 
Docker, or Kubernetes,
+  the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as a DAG Author. Only 
knowledge of Python requirements
+  is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by this operator:
+
+* Your python callable has to be serializable. There are a number of python 
objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those 
limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in the Airflow environment must be 
locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those 
libraries.
+* The virtual environments are run in the same operating system, so they 
cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python 
dependencies can be independently
+  installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running 
each task - Airflow has
+  to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install 
dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for 
example when your repo is not available
+  or when there is a networking issue with reaching the repository)
+* It's easy to  fall into a "too" dynamic environment - since the dependencies 
you install might get upgraded
+  and their transitive dependencies might get independent upgrades you might 
end up with the situation where
+  your task will stop working because someone released a new version of a 
dependency or you might fall
+  a victim of "supply chain" attack where new version of a dependency might 
become malicious
+* The tasks are only isolated from each other via running in different 
environments. This makes it possible
+  that running tasks will still interfere with each other - for example 
subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PythonPreexistingVirtualenvOperator
+-----------------------------------------
+
+.. versionadded:: 2.4
+
+A bit more complex but with significantly less overhead, security, stability 
problems is to use the
+:class:`airflow.operators.python.PythonPreexistingVirtualenvOperator``, or 
even better - decorating your callable with
+``@task.preexisting_virtualenv`` decorator. It requires however that the 
virtualenv you use is immutable.
+The ``immutable`` in this context means that (unlike in 
:class:`airflow.operators.python.PythonVirtualenvOperator`)
+you cannot add new dependencies to such pre-existing virtualenv. All 
dependencies you need should be added

Review Comment:
   ```suggestion
   you cannot add new dependencies from the task to such a pre-existing 
virtualenv. All dependencies you need should be added
   ```



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,221 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is 
very large, consider pruning some of the old data with the :ref:`db 
clean<cli-db-clean>` command prior to performing the upgrade.  *Use with 
caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies 
are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single 
set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks 
require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, 
there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow 
dependencies. Airflow uses constraints mechanism
+which means that you have a "fixed" set of dependencies that the community 
guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you 
can upgrade the providers
+independently and their constraints do not limit you so the chance of a 
conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined 
operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you 
use TaskFlow Api and most of
+your operators are written using custom python code, or when you want to write 
your own Custom Operator,
+you might get to the point where the dependencies required by the custom code 
of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators 
introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. 
And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a 
bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PythonPreexistingVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some 
limits and overhead), and
+we will gradually go through those strategies that requires some changes in 
your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The 
PythonVirtualenvOperator allows you to dynamically
+create a virtualenv that your Python callable function will execute in. In the 
modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be 
done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using 
the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have its own independent Python virtualenv and can specify fine-grained set of 
requirements that need
+to be installed for that task to execute.
+
+The operator takes care of:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the 
virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via 
xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created 
before task is run, and
+  removed after it is finished, so there is nothing special (except having 
virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - 
thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the 
venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the 
venvs for you.
+  As a DAG Author, you only have to have virtualenv dependency installed and 
you can specify and modify the
+  environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or 
Docker, or Kubernetes,
+  the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as a DAG Author. Only 
knowledge of Python requirements
+  is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by this operator:
+
+* Your python callable has to be serializable. There are a number of python 
objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those 
limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in the Airflow environment must be 
locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those 
libraries.
+* The virtual environments are run in the same operating system, so they 
cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python 
dependencies can be independently
+  installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running 
each task - Airflow has
+  to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install 
dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for 
example when your repo is not available
+  or when there is a networking issue with reaching the repository)
+* It's easy to  fall into a "too" dynamic environment - since the dependencies 
you install might get upgraded
+  and their transitive dependencies might get independent upgrades you might 
end up with the situation where
+  your task will stop working because someone released a new version of a 
dependency or you might fall
+  a victim of "supply chain" attack where new version of a dependency might 
become malicious
+* The tasks are only isolated from each other via running in different 
environments. This makes it possible
+  that running tasks will still interfere with each other - for example 
subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PythonPreexistingVirtualenvOperator
+-----------------------------------------
+
+.. versionadded:: 2.4
+
+A bit more complex but with significantly less overhead, security, stability 
problems is to use the

Review Comment:
   ```suggestion
   A bit more complex but with significantly less overhead, security, and 
stability problems is to use the
   ```



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,221 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is 
very large, consider pruning some of the old data with the :ref:`db 
clean<cli-db-clean>` command prior to performing the upgrade.  *Use with 
caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies 
are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single 
set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks 
require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, 
there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow 
dependencies. Airflow uses constraints mechanism
+which means that you have a "fixed" set of dependencies that the community 
guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you 
can upgrade the providers
+independently and their constraints do not limit you so the chance of a 
conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined 
operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you 
use TaskFlow Api and most of
+your operators are written using custom python code, or when you want to write 
your own Custom Operator,
+you might get to the point where the dependencies required by the custom code 
of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators 
introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. 
And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a 
bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PythonPreexistingVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some 
limits and overhead), and
+we will gradually go through those strategies that requires some changes in 
your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The 
PythonVirtualenvOperator allows you to dynamically
+create a virtualenv that your Python callable function will execute in. In the 
modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be 
done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using 
the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have its own independent Python virtualenv and can specify fine-grained set of 
requirements that need
+to be installed for that task to execute.
+
+The operator takes care of:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the 
virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via 
xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created 
before task is run, and
+  removed after it is finished, so there is nothing special (except having 
virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - 
thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the 
venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the 
venvs for you.
+  As a DAG Author, you only have to have virtualenv dependency installed and 
you can specify and modify the
+  environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or 
Docker, or Kubernetes,
+  the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as a DAG Author. Only 
knowledge of Python requirements
+  is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by this operator:
+
+* Your python callable has to be serializable. There are a number of python 
objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those 
limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in the Airflow environment must be 
locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those 
libraries.
+* The virtual environments are run in the same operating system, so they 
cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python 
dependencies can be independently
+  installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running 
each task - Airflow has
+  to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install 
dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for 
example when your repo is not available
+  or when there is a networking issue with reaching the repository)
+* It's easy to  fall into a "too" dynamic environment - since the dependencies 
you install might get upgraded
+  and their transitive dependencies might get independent upgrades you might 
end up with the situation where
+  your task will stop working because someone released a new version of a 
dependency or you might fall
+  a victim of "supply chain" attack where new version of a dependency might 
become malicious
+* The tasks are only isolated from each other via running in different 
environments. This makes it possible
+  that running tasks will still interfere with each other - for example 
subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PythonPreexistingVirtualenvOperator
+-----------------------------------------
+
+.. versionadded:: 2.4
+
+A bit more complex but with significantly less overhead, security, stability 
problems is to use the
+:class:`airflow.operators.python.PythonPreexistingVirtualenvOperator``, or 
even better - decorating your callable with
+``@task.preexisting_virtualenv`` decorator. It requires however that the 
virtualenv you use is immutable.
+The ``immutable`` in this context means that (unlike in 
:class:`airflow.operators.python.PythonVirtualenvOperator`)
+you cannot add new dependencies to such pre-existing virtualenv. All 
dependencies you need should be added
+upfront in your environment (and available in all the workers in case your 
Airflow runs in a distributed
+environment). This way you avoid the overhead and problems of re-creating the
+virtual environment but they have to be prepared and deployed together with 
Airflow installation, so usually people
+who manage Airflow installation need to be involved (and in bigger 
installations those are usually different
+people than DAG Authors (DevOps/System Admins).
+
+Those virtual environments can be prepared in various ways - if you use 
LocalExecutor they just need to be installed
+at the machine where scheduler is run, if you are using distributed Celery 
virtualenv installations, there
+should be a pipeline that installs those virtual environments across multiple 
machines, finally if you are using
+Docker Image (for example via Kubernetes), the virtualenv creation should be 
added to the pipeline of
+your custom image building.
+
+The benefits of the operator are:
+
+* No setup overhead when running the task. The virtualenv is ready when you 
start running a task.
+* You can run tasks with different sets of dependencies on the same workers - 
thus all resources are reused.
+* There is no need to have access by workers to PyPI or private repositories. 
Less chance for transient
+  errors resulting from networking.
+* The dependencies can be pre-vetted by the admins and your security team, no 
unexpected, new code will
+  be added dynamically. This is good for both, security and stability.
+* Limited impact on your deployment - you do not need to switch to Docker 
containers or Kubernetes to
+  make a good use of the operator.
+* No need to learn more about containers, Kubernetes as a DAG Author. Only 
knowledge of Python, requirements

Review Comment:
   ```suggestion
   * No need to learn more about containers or Kubernetes as a DAG Author. Only 
knowledge of Python and requirements
   ```



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,221 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is 
very large, consider pruning some of the old data with the :ref:`db 
clean<cli-db-clean>` command prior to performing the upgrade.  *Use with 
caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies 
are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single 
set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks 
require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, 
there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow 
dependencies. Airflow uses constraints mechanism
+which means that you have a "fixed" set of dependencies that the community 
guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you 
can upgrade the providers
+independently and their constraints do not limit you so the chance of a 
conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined 
operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you 
use TaskFlow Api and most of
+your operators are written using custom python code, or when you want to write 
your own Custom Operator,
+you might get to the point where the dependencies required by the custom code 
of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators 
introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. 
And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a 
bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PythonPreexistingVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some 
limits and overhead), and
+we will gradually go through those strategies that requires some changes in 
your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The 
PythonVirtualenvOperator allows you to dynamically
+create a virtualenv that your Python callable function will execute in. In the 
modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be 
done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using 
the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have its own independent Python virtualenv and can specify fine-grained set of 
requirements that need
+to be installed for that task to execute.
+
+The operator takes care of:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the 
virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via 
xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created 
before task is run, and
+  removed after it is finished, so there is nothing special (except having 
virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - 
thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the 
venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the 
venvs for you.
+  As a DAG Author, you only have to have virtualenv dependency installed and 
you can specify and modify the
+  environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or 
Docker, or Kubernetes,
+  the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as a DAG Author. Only 
knowledge of Python requirements
+  is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by this operator:
+
+* Your python callable has to be serializable. There are a number of python 
objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those 
limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in the Airflow environment must be 
locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those 
libraries.
+* The virtual environments are run in the same operating system, so they 
cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python 
dependencies can be independently
+  installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running 
each task - Airflow has
+  to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install 
dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for 
example when your repo is not available
+  or when there is a networking issue with reaching the repository)
+* It's easy to  fall into a "too" dynamic environment - since the dependencies 
you install might get upgraded
+  and their transitive dependencies might get independent upgrades you might 
end up with the situation where
+  your task will stop working because someone released a new version of a 
dependency or you might fall
+  a victim of "supply chain" attack where new version of a dependency might 
become malicious
+* The tasks are only isolated from each other via running in different 
environments. This makes it possible
+  that running tasks will still interfere with each other - for example 
subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PythonPreexistingVirtualenvOperator
+-----------------------------------------
+
+.. versionadded:: 2.4
+
+A bit more complex but with significantly less overhead, security, stability 
problems is to use the
+:class:`airflow.operators.python.PythonPreexistingVirtualenvOperator``, or 
even better - decorating your callable with
+``@task.preexisting_virtualenv`` decorator. It requires however that the 
virtualenv you use is immutable.
+The ``immutable`` in this context means that (unlike in 
:class:`airflow.operators.python.PythonVirtualenvOperator`)
+you cannot add new dependencies to such pre-existing virtualenv. All 
dependencies you need should be added
+upfront in your environment (and available in all the workers in case your 
Airflow runs in a distributed
+environment). This way you avoid the overhead and problems of re-creating the
+virtual environment but they have to be prepared and deployed together with 
Airflow installation, so usually people
+who manage Airflow installation need to be involved (and in bigger 
installations those are usually different
+people than DAG Authors (DevOps/System Admins).
+
+Those virtual environments can be prepared in various ways - if you use 
LocalExecutor they just need to be installed
+at the machine where scheduler is run, if you are using distributed Celery 
virtualenv installations, there
+should be a pipeline that installs those virtual environments across multiple 
machines, finally if you are using
+Docker Image (for example via Kubernetes), the virtualenv creation should be 
added to the pipeline of
+your custom image building.
+
+The benefits of the operator are:
+
+* No setup overhead when running the task. The virtualenv is ready when you 
start running a task.
+* You can run tasks with different sets of dependencies on the same workers - 
thus all resources are reused.
+* There is no need to have access by workers to PyPI or private repositories. 
Less chance for transient
+  errors resulting from networking.
+* The dependencies can be pre-vetted by the admins and your security team, no 
unexpected, new code will
+  be added dynamically. This is good for both, security and stability.
+* Limited impact on your deployment - you do not need to switch to Docker 
containers or Kubernetes to
+  make a good use of the operator.
+* No need to learn more about containers, Kubernetes as a DAG Author. Only 
knowledge of Python, requirements
+  is required to author DAGs this way.
+
+The drawbacks:
+
+* Your environment needs to have the virtual environments prepared upfront. 
This usually means that you
+  cannot change it on the fly, adding new or changing requirements require at 
least an Airflow re-deployment
+  and iteration time when you work on new versions might be longer.
+* Your python callable has to be serializable. There are a number of python 
objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those 
limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be 
locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those 
libraries.
+* The virtual environments are run in the same operating system, so they 
cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python 
dependencies can be independently
+  installed in those environments
+* The tasks are only isolated from each other via running in different 
environments. This makes it possible
+  that running tasks will still interfere with each other - for example 
subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c

Review Comment:
   ```suggestion
     same worker might be affected by previous tasks creating/modifying files 
etc.
   ```



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,221 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is 
very large, consider pruning some of the old data with the :ref:`db 
clean<cli-db-clean>` command prior to performing the upgrade.  *Use with 
caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies 
are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single 
set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks 
require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, 
there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow 
dependencies. Airflow uses constraints mechanism
+which means that you have a "fixed" set of dependencies that the community 
guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you 
can upgrade the providers
+independently and their constraints do not limit you so the chance of a 
conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined 
operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you 
use TaskFlow Api and most of
+your operators are written using custom python code, or when you want to write 
your own Custom Operator,
+you might get to the point where the dependencies required by the custom code 
of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators 
introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. 
And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a 
bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PythonPreexistingVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some 
limits and overhead), and
+we will gradually go through those strategies that requires some changes in 
your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The 
PythonVirtualenvOperator allows you to dynamically
+create a virtualenv that your Python callable function will execute in. In the 
modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be 
done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using 
the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have its own independent Python virtualenv and can specify fine-grained set of 
requirements that need
+to be installed for that task to execute.
+
+The operator takes care of:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the 
virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via 
xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created 
before task is run, and
+  removed after it is finished, so there is nothing special (except having 
virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - 
thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the 
venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the 
venvs for you.
+  As a DAG Author, you only have to have virtualenv dependency installed and 
you can specify and modify the
+  environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or 
Docker, or Kubernetes,
+  the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as a DAG Author. Only 
knowledge of Python requirements
+  is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by this operator:
+
+* Your python callable has to be serializable. There are a number of python 
objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those 
limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in the Airflow environment must be 
locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those 
libraries.
+* The virtual environments are run in the same operating system, so they 
cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python 
dependencies can be independently
+  installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running 
each task - Airflow has
+  to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install 
dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for 
example when your repo is not available
+  or when there is a networking issue with reaching the repository)
+* It's easy to  fall into a "too" dynamic environment - since the dependencies 
you install might get upgraded
+  and their transitive dependencies might get independent upgrades you might 
end up with the situation where
+  your task will stop working because someone released a new version of a 
dependency or you might fall
+  a victim of "supply chain" attack where new version of a dependency might 
become malicious
+* The tasks are only isolated from each other via running in different 
environments. This makes it possible
+  that running tasks will still interfere with each other - for example 
subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PythonPreexistingVirtualenvOperator
+-----------------------------------------
+
+.. versionadded:: 2.4
+
+A bit more complex but with significantly less overhead, security, stability 
problems is to use the
+:class:`airflow.operators.python.PythonPreexistingVirtualenvOperator``, or 
even better - decorating your callable with
+``@task.preexisting_virtualenv`` decorator. It requires however that the 
virtualenv you use is immutable.
+The ``immutable`` in this context means that (unlike in 
:class:`airflow.operators.python.PythonVirtualenvOperator`)
+you cannot add new dependencies to such pre-existing virtualenv. All 
dependencies you need should be added
+upfront in your environment (and available in all the workers in case your 
Airflow runs in a distributed
+environment). This way you avoid the overhead and problems of re-creating the
+virtual environment but they have to be prepared and deployed together with 
Airflow installation, so usually people
+who manage Airflow installation need to be involved (and in bigger 
installations those are usually different
+people than DAG Authors (DevOps/System Admins).
+
+Those virtual environments can be prepared in various ways - if you use 
LocalExecutor they just need to be installed
+at the machine where scheduler is run, if you are using distributed Celery 
virtualenv installations, there
+should be a pipeline that installs those virtual environments across multiple 
machines, finally if you are using
+Docker Image (for example via Kubernetes), the virtualenv creation should be 
added to the pipeline of
+your custom image building.
+
+The benefits of the operator are:
+
+* No setup overhead when running the task. The virtualenv is ready when you 
start running a task.
+* You can run tasks with different sets of dependencies on the same workers - 
thus all resources are reused.
+* There is no need to have access by workers to PyPI or private repositories. 
Less chance for transient
+  errors resulting from networking.

Review Comment:
   ```suggestion
   * There is no need to have access by workers to PyPI or private 
repositories; less chance for transient
     errors resulting from networking glitches.
   ```



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,221 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is 
very large, consider pruning some of the old data with the :ref:`db 
clean<cli-db-clean>` command prior to performing the upgrade.  *Use with 
caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies 
are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single 
set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks 
require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, 
there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow 
dependencies. Airflow uses constraints mechanism
+which means that you have a "fixed" set of dependencies that the community 
guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you 
can upgrade the providers
+independently and their constraints do not limit you so the chance of a 
conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined 
operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you 
use TaskFlow Api and most of
+your operators are written using custom python code, or when you want to write 
your own Custom Operator,
+you might get to the point where the dependencies required by the custom code 
of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators 
introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. 
And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a 
bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PythonPreexistingVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some 
limits and overhead), and
+we will gradually go through those strategies that requires some changes in 
your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The 
PythonVirtualenvOperator allows you to dynamically
+create a virtualenv that your Python callable function will execute in. In the 
modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be 
done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using 
the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have its own independent Python virtualenv and can specify fine-grained set of 
requirements that need
+to be installed for that task to execute.
+
+The operator takes care of:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the 
virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via 
xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created 
before task is run, and
+  removed after it is finished, so there is nothing special (except having 
virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - 
thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the 
venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the 
venvs for you.
+  As a DAG Author, you only have to have virtualenv dependency installed and 
you can specify and modify the
+  environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or 
Docker, or Kubernetes,
+  the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as a DAG Author. Only 
knowledge of Python requirements
+  is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by this operator:
+
+* Your python callable has to be serializable. There are a number of python 
objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those 
limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in the Airflow environment must be 
locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those 
libraries.
+* The virtual environments are run in the same operating system, so they 
cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python 
dependencies can be independently
+  installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running 
each task - Airflow has
+  to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install 
dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for 
example when your repo is not available
+  or when there is a networking issue with reaching the repository)
+* It's easy to  fall into a "too" dynamic environment - since the dependencies 
you install might get upgraded
+  and their transitive dependencies might get independent upgrades you might 
end up with the situation where
+  your task will stop working because someone released a new version of a 
dependency or you might fall
+  a victim of "supply chain" attack where new version of a dependency might 
become malicious
+* The tasks are only isolated from each other via running in different 
environments. This makes it possible
+  that running tasks will still interfere with each other - for example 
subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PythonPreexistingVirtualenvOperator
+-----------------------------------------
+
+.. versionadded:: 2.4
+
+A bit more complex but with significantly less overhead, security, stability 
problems is to use the
+:class:`airflow.operators.python.PythonPreexistingVirtualenvOperator``, or 
even better - decorating your callable with
+``@task.preexisting_virtualenv`` decorator. It requires however that the 
virtualenv you use is immutable.
+The ``immutable`` in this context means that (unlike in 
:class:`airflow.operators.python.PythonVirtualenvOperator`)
+you cannot add new dependencies to such pre-existing virtualenv. All 
dependencies you need should be added
+upfront in your environment (and available in all the workers in case your 
Airflow runs in a distributed
+environment). This way you avoid the overhead and problems of re-creating the
+virtual environment but they have to be prepared and deployed together with 
Airflow installation, so usually people
+who manage Airflow installation need to be involved (and in bigger 
installations those are usually different
+people than DAG Authors (DevOps/System Admins).
+
+Those virtual environments can be prepared in various ways - if you use 
LocalExecutor they just need to be installed
+at the machine where scheduler is run, if you are using distributed Celery 
virtualenv installations, there
+should be a pipeline that installs those virtual environments across multiple 
machines, finally if you are using
+Docker Image (for example via Kubernetes), the virtualenv creation should be 
added to the pipeline of
+your custom image building.
+
+The benefits of the operator are:
+
+* No setup overhead when running the task. The virtualenv is ready when you 
start running a task.
+* You can run tasks with different sets of dependencies on the same workers - 
thus all resources are reused.
+* There is no need to have access by workers to PyPI or private repositories. 
Less chance for transient
+  errors resulting from networking.
+* The dependencies can be pre-vetted by the admins and your security team, no 
unexpected, new code will
+  be added dynamically. This is good for both, security and stability.
+* Limited impact on your deployment - you do not need to switch to Docker 
containers or Kubernetes to
+  make a good use of the operator.
+* No need to learn more about containers, Kubernetes as a DAG Author. Only 
knowledge of Python, requirements
+  is required to author DAGs this way.
+
+The drawbacks:
+
+* Your environment needs to have the virtual environments prepared upfront. 
This usually means that you
+  cannot change it on the fly, adding new or changing requirements require at 
least an Airflow re-deployment
+  and iteration time when you work on new versions might be longer.
+* Your python callable has to be serializable. There are a number of python 
objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those 
limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be 
locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those 
libraries.
+* The virtual environments are run in the same operating system, so they 
cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python 
dependencies can be independently
+  installed in those environments
+* The tasks are only isolated from each other via running in different 
environments. This makes it possible
+  that running tasks will still interfere with each other - for example 
subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+Actually, you can think about the ``PythonVirtualenvOperator`` and 
``PythonPreexistingVirtualenvOperator``

Review Comment:
   ```suggestion
   You can think about the ``PythonVirtualenvOperator`` and 
``PythonPreexistingVirtualenvOperator``
   ```



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,221 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is 
very large, consider pruning some of the old data with the :ref:`db 
clean<cli-db-clean>` command prior to performing the upgrade.  *Use with 
caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies 
are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single 
set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks 
require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, 
there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow 
dependencies. Airflow uses constraints mechanism
+which means that you have a "fixed" set of dependencies that the community 
guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you 
can upgrade the providers
+independently and their constraints do not limit you so the chance of a 
conflicting dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined 
operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you 
use TaskFlow Api and most of
+your operators are written using custom python code, or when you want to write 
your own Custom Operator,
+you might get to the point where the dependencies required by the custom code 
of yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators 
introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. 
And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a 
bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PythonPreexistingVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some 
limits and overhead), and
+we will gradually go through those strategies that requires some changes in 
your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The 
PythonVirtualenvOperator allows you to dynamically
+create a virtualenv that your Python callable function will execute in. In the 
modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be 
done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using 
the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have its own independent Python virtualenv and can specify fine-grained set of 
requirements that need
+to be installed for that task to execute.
+
+The operator takes care of:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the 
virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via 
xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created 
before task is run, and
+  removed after it is finished, so there is nothing special (except having 
virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - 
thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the 
venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the 
venvs for you.
+  As a DAG Author, you only have to have virtualenv dependency installed and 
you can specify and modify the
+  environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or 
Docker, or Kubernetes,
+  the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as a DAG Author. Only 
knowledge of Python requirements
+  is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by this operator:
+
+* Your python callable has to be serializable. There are a number of python 
objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those 
limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in the Airflow environment must be 
locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those 
libraries.
+* The virtual environments are run in the same operating system, so they 
cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python 
dependencies can be independently
+  installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running 
each task - Airflow has
+  to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install 
dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for 
example when your repo is not available
+  or when there is a networking issue with reaching the repository)
+* It's easy to  fall into a "too" dynamic environment - since the dependencies 
you install might get upgraded
+  and their transitive dependencies might get independent upgrades you might 
end up with the situation where
+  your task will stop working because someone released a new version of a 
dependency or you might fall
+  a victim of "supply chain" attack where new version of a dependency might 
become malicious
+* The tasks are only isolated from each other via running in different 
environments. This makes it possible
+  that running tasks will still interfere with each other - for example 
subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PythonPreexistingVirtualenvOperator
+-----------------------------------------
+
+.. versionadded:: 2.4
+
+A bit more complex but with significantly less overhead, security, stability 
problems is to use the
+:class:`airflow.operators.python.PythonPreexistingVirtualenvOperator``, or 
even better - decorating your callable with
+``@task.preexisting_virtualenv`` decorator. It requires however that the 
virtualenv you use is immutable.
+The ``immutable`` in this context means that (unlike in 
:class:`airflow.operators.python.PythonVirtualenvOperator`)
+you cannot add new dependencies to such pre-existing virtualenv. All 
dependencies you need should be added
+upfront in your environment (and available in all the workers in case your 
Airflow runs in a distributed
+environment). This way you avoid the overhead and problems of re-creating the
+virtual environment but they have to be prepared and deployed together with 
Airflow installation, so usually people
+who manage Airflow installation need to be involved (and in bigger 
installations those are usually different
+people than DAG Authors (DevOps/System Admins).
+
+Those virtual environments can be prepared in various ways - if you use 
LocalExecutor they just need to be installed
+at the machine where scheduler is run, if you are using distributed Celery 
virtualenv installations, there
+should be a pipeline that installs those virtual environments across multiple 
machines, finally if you are using
+Docker Image (for example via Kubernetes), the virtualenv creation should be 
added to the pipeline of
+your custom image building.
+
+The benefits of the operator are:
+
+* No setup overhead when running the task. The virtualenv is ready when you 
start running a task.
+* You can run tasks with different sets of dependencies on the same workers - 
thus all resources are reused.
+* There is no need to have access by workers to PyPI or private repositories. 
Less chance for transient
+  errors resulting from networking.
+* The dependencies can be pre-vetted by the admins and your security team, no 
unexpected, new code will
+  be added dynamically. This is good for both, security and stability.
+* Limited impact on your deployment - you do not need to switch to Docker 
containers or Kubernetes to
+  make a good use of the operator.
+* No need to learn more about containers, Kubernetes as a DAG Author. Only 
knowledge of Python, requirements
+  is required to author DAGs this way.
+
+The drawbacks:
+
+* Your environment needs to have the virtual environments prepared upfront. 
This usually means that you
+  cannot change it on the fly, adding new or changing requirements require at 
least an Airflow re-deployment
+  and iteration time when you work on new versions might be longer.
+* Your python callable has to be serializable. There are a number of python 
objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those 
limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be 
locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those 
libraries.
+* The virtual environments are run in the same operating system, so they 
cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python 
dependencies can be independently
+  installed in those environments
+* The tasks are only isolated from each other via running in different 
environments. This makes it possible
+  that running tasks will still interfere with each other - for example 
subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+Actually, you can think about the ``PythonVirtualenvOperator`` and 
``PythonPreexistingVirtualenvOperator``
+as counterparts - as a DAG author you'd normally iterate with dependencies and 
develop your DAG using
+``PythonVirtualenvOperator`` (thus decorating your tasks with 
``@task.virtualenv`` decorators) while
+after the iteration and changes you would likely want to change it for 
production to switch to
+the ``PythonPreexistingVirtualenvOperator`` after your DevOps/System Admin 
teams deploy your new
+virtualenv to production. The nice thing about this is that you can switch the 
decorator back
+at any time and continue developing it "dynamically" with 
``PythonVirtualenvOperator``.
+
+
+Using DockerOperator or Kubernetes Pod Operator
+-----------------------------------------------
+
+Another strategy is to use the Docker Operator or the Kubernetes Pod Operator. 
Those require that Airflow runs in a
+Docker container environment or Kubernetes environment (or at the very least 
have access to create and
+run tasks with those).
+
+Similarly as in case of Python operators, the taskflow decorators are handy 
for you if you would like to
+use those operators to execute your callable Python code.
+
+However, it is far more involved - you need to understand how 
Docker/Kubernetes Pods work if you want to use
+this approach, but the tasks are fully isolated from each other and you are 
not even limited to running
+Python code. You can write your tasks in any Programming language you want. 
Also your dependencies are
+fully independent from Airflow ones (including the system level dependencies) 
so if your task require
+a very different environment, this is the way to go. Those are 
``@task.docker`` and ``@task.kubernetes``
+decorators.
+
+The benefits of those operators are:
+
+* You can run tasks with different sets of both Python and system level 
dependencies, or even tasks
+  written in completely different language or even different processor 
architecture (x86 vs. arm).
+* The environment used to run the tasks enjoys the optimizations and 
immutability of containers, where a
+  similar set of dependencies can effectively reuse a number of cached layers 
of the image, so the
+  environment is optimized for the case where you have multiple similar, but 
different environments.
+* The dependencies can be pre-vetted by the admins and your security team, no 
unexpected, new code will
+  be added dynamically. This is good for both, security and stability.

Review Comment:
   ```suggestion
     be added at task runtime. This is good for both security and stability.
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [airflow] ashb commented on a diff in pull request #25780: Implement PythonPreexistingVirtualenvOperator

Reply via email to