[GitHub] [airflow] potiuk commented on a diff in pull request #25780: Implement PythonPreexistingVirtualenvOperator

GitBox Mon, 22 Aug 2022 12:53:29 -0700


potiuk commented on code in PR #25780:
URL: https://github.com/apache/airflow/pull/25780#discussion_r951853331



##########
docs/apache-airflow/best-practices.rst:
##########
@@ -619,3 +621,219 @@ Prune data before upgrading
 ---------------------------
 
 Some database migrations can be time-consuming.  If your metadata database is 
very large, consider pruning some of the old data with the :ref:`db 
clean<cli-db-clean>` command prior to performing the upgrade.  *Use with 
caution.*
+
+
+Handling Python dependencies
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Airflow has many Python dependencies and sometimes the Airflow dependencies 
are conflicting with dependencies that your
+task code expects. Since - by default - Airflow environment is just a single 
set of Python dependencies and single
+Python environment, often there might also be cases that some of your tasks 
require different dependencies than other tasks
+and the dependencies basically conflict between those tasks.
+
+If you are using pre-defined Airflow Operators to talk to external services, 
there is not much choice, but usually those
+operators will have dependencies that are not conflicting with basic Airflow 
dependencies. Airflow uses constraints mechanism
+which means that you have "fixed" set of dependencies that the community 
guarantees that Airflow can be installed with
+(including all community providers) without triggering conflicts. However you 
can upgrade the providers
+independently and there constraints do not limit you so chance of conflicting 
dependency is lower (you still have
+to test those dependencies). Therefore when you are using pre-defined 
operators, chance is that you will have
+little, to no problems with conflicting dependencies.
+
+However, when you are approaching Airflow in a more "modern way", where you 
use TaskFlow Api and most of
+your operators is written using custom python code, or when you want to write 
your own Custom Operator,
+you might get to the point where dependencies required by the custom code of 
yours are conflicting with those
+of Airflow, or even that dependencies of several of your Custom Operators 
introduce conflicts between themselves.
+
+There are a number of strategies that can be employed to mitigate the problem. 
And while dealing with
+dependency conflict in custom operators is difficult, it's actually quite a 
bit easier when it comes to
+Task-Flow approach or (equivalently) using ``PythonVirtualenvOperator`` or
+``PreexistingPythonVirtualenvOperator``.
+
+Let's start from the strategies that are easiest to implement (having some 
limits and overhead), and
+we will gradually go through those strategies that requires some changes in 
your Airflow deployment.
+
+Using PythonVirtualenvOperator
+------------------------------
+
+This is simplest to use and most limited strategy. The 
PythonVirtualenvOperator allows you to dynamically
+create virtualenv that your Python callable function will execute in. In the 
modern
+TaskFlow approach described in :doc:`/tutorial_taskflow_api`. this also can be 
done with decorating
+your callable with ``@task.virtualenv`` decorator (recommended way of using 
the operator).
+Each :class:`airflow.operators.python.PythonVirtualenvOperator` task can
+have it's own independent Python virtualenv and can specify fine-grained set 
of requirements that need
+to be installed for that task to execute.
+
+The operator takes care about:
+
+* creating the virtualenv based on your environment
+* serializing your Python callable and passing it to execution by the 
virtualenv Python interpreter
+* executing it and retrieving the result of the callable and pushing it via 
xcom if specified
+
+The benefits of the operator are:
+
+* There is no need to prepare the venv upfront. It will be dynamically created 
before task is run, and
+  removed after it is finished, so there is nothing special (except having 
virtualenv package in your
+  airflow dependencies) to make use of multiple virtual environments
+* You can run tasks with different sets of dependencies on the same workers - 
thus Memory resources are
+  reused (though see below about the CPU overhead involved in creating the 
venvs).
+* In bigger installations, DAG Authors do not need to ask anyone to create the 
venvs for you.
+  As DAG Author, you only have to have virtualenv dependency installed and you 
can specify and modify the
+  environments as you see fit.
+* No changes in deployment requirements - whether you use Local virtualenv, or 
Docker, or Kubernetes,
+  the tasks will work without adding anything to your deployment.
+* No need to learn more about containers, Kubernetes as DAG Author. Only 
knowledge of Python, requirements
+  is required to author DAGs this way.
+
+There are certain limitations and overhead introduced by the operator:
+
+* Your python callable has to be serializable. There are a number of python 
objects that are not serializable
+  using standard ``pickle`` library. You can mitigate some of those 
limitations by using ``dill`` library
+  but even that library does not solve all the serialization limitations.
+* All dependencies that are not available in Airflow environment must be 
locally imported in the callable you
+  use and the top-level Python code of your DAG should not import/use those 
libraries.
+* The virtual environments are run in the same operating system, so they 
cannot have conflicting system-level
+  dependencies (``apt`` or ``yum`` installable packages). Only Python 
dependencies can be independently
+  installed in those environments.
+* The operator adds a CPU, networking and elapsed time overhead for running 
each task - Airflow has
+  to re-create the virtualenv from scratch for each task
+* The workers need to have access to PyPI or private repositories to install 
dependencies
+* The dynamic creation of virtualenv is prone to transient failures (for 
example when your repo is not available
+  or when there is a networking issue with reaching the repository
+* It's easy to  fall into a "too" dynamic environment - since the dependencies 
you install might get upgraded
+  and their transitive dependencies might get independent upgrades you might 
end up with the situation where
+  your task will stop working because someone released a new version of a 
dependency or you might fall
+  a victim of "supply chain" attack where new version of a dependency might 
become malicious
+* The tasks are only isolated from each other via running in different 
environments. This makes it possible
+  that running tasks will still interfere with each other - for example 
subsequent tasks executed on the
+  same worker might be affected by previous tasks creating/modifying files et.c
+
+
+Using PreexistingPythonVirtualenvOperator
+-----------------------------------------
+
+.. versionadded:: 2.4
+
+A bit more complex but with significantly less overhead, security, stability 
problems is to use the
+:class:`airflow.operators.python.PreexistingPythonVirtualenvOperator``, or 
even better - decorating your callable with
+``@task.preexisting_virtualenv`` decorator. It requires however that the 
virtualenv you use is immutable
+by the task and prepared upfront in your environment (and available in all the 
workers in case your

Review Comment:
   This is mostly result of earlier question at the devlist "should we allow 
the user to also add extra requirements similar as in PythonVirtualenvOperator? 
   
   My answer is "no". The env should be immutable - if we allow to mix 
"preexisting" but "dynamic" virtualenvs, this opens up a host of edge cases. 
This is why immutable is mentioned. We might want to add more info about it if 
you think it is unclear.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [airflow] potiuk commented on a diff in pull request #25780: Implement PythonPreexistingVirtualenvOperator

Reply via email to