Hello everyone,

*TL;DR;* I propose to extend our PythonVirtualenvOperator with "use
existing venv" feature and make it a viable way of handling some
multi-dependency sets using multiple pre-installed venvs.

*More context:*

I had this idea coming after a discussion in our Slack:
https://apache-airflow.slack.com/archives/CCV3FV9KL/p1660233834355179

My thoughts were - why don't we add support for "use existing venv" in
PythonVirtualenvOperator as first-class-citizen ?

Currently (unless there are some tricks I am not aware of) or extend PVO,
the PVO will always attempt to create a virtualenv based on extra
requirements. And while it gives the users a possibility of having some
tasks use different dependencies, the drawback is that the venv is created
dynamically when tasks starts - potentially a lot of overhead for  startup
time and some unpleasant failure scenarios - like networking problems, PyPI
or local repoi not available, automated (and unnoticed) upgrade of
dependencies.

Those are basically the same problems that caused us to strongly discourage
our users in our Helm Chart to use _PIP_ADDITIONAL_DEPENDENCIES in
production and criticize the  Community Helm Chart for dynamic dependency
installation they promote as a "valid" approach. Yet our PVO currently does
exactly this.

We had some past discussions how this can be improved - with caching, or
using different images for different dependencies and similar - and even we
have
https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-46+Runtime+isolation+for+airflow+tasks+and+dag+parsing
proposal to use different images for different sets of requirements.

*Proposal:*

During the discussion yesterday I started to think a simpler solution is
possible and rather simple to implement by us and for users to use.

Why not have different venvs preinstalled and let the PVO choose the one
that should be used?

It does not invalidate AIP-46. AIP-46 serves a bit different purpose and
some cases cannot be handled this way - when you need different "system
level" dependencies for example) but it might be much simpler from
deployment point of view and allow it to handle "multi-dependency sets" for
Python libraries only with minimal deployment overhead (which AIP-46
necessarily has). And I think it will be enough for a vast number of the
"multi-dependency-sets" cases.

Why don't we allow the users to prepare those venvs upfront and simply
enable PVE to use them rather than create them dynamically ?

*Advantages:*

* it nicely handles cases where some of your tasks need a different set of
dependencies than others (for execution, not necessarily parsing at least
initially).

* no startup time overhead needed as with current PVO

* possible to run in both cases - "venv installation" and "docker image"
installation

* it has finer granularity level than AIP-46 - unlike in AIP-46 you could
use different sets of dependencies

* very easy to pull off for the users without modifying their
deployments,For local venv, you just create the venvs, For Docker image
case, your custom image needs to add several lines similar to:

RUN python -m venv --system-site-packages PACKAGE1==NN PACKAGE2==NN
/opt/venv1
RUN python -m venv --system-site-packages PACKAGE1==NN PACKAGE2==NN
/opt/venv2

and PythonVenvOperator should have extra "use_existing_venv=/opt/venv2")
parameter

* we only need to manage ONE image (!) even if you have multiple sets of
dependencies (this has the advantage that it is actually LOWER overhead
than having separate images for each env -when it comes to
various resources overhead (same workers could handle multiple dependency
sets for examples, same image is reused by multiple PODs in K8S etc. ).

* later (when AIP-43 (separate dag processor with ability to use different
processors for different subdirectories) is completed and AIP-46 is
approved/implemented, we could also extend DAG Parsing to be able to use
those predefined venvs for parsing. That would eliminate the need for local
imports and add support to even use different sets of libraries in
top-level code (per DAG, not per task). It would not solve different
"system" level dependencies - and for that AiP-46 is still a very valid
case.

*Disadvantages:*

I thought very hard about this one and I actually could not find any
disadvantages :)

It's simple to implement, use and explain, it can be implemented very
quickly (like - in a few hours with tests and documentation I think) and
performance-wise it is better for any other solution (including AIP-46)
providing that the case is limited to different Python dependencies.

But possibly there are things that I missed. It all looks too good to be
true, and I wonder why we do not have it already today - once I thought
about it, it seems very obvious. So I probably missed something.

WDYT?

J.

Reply via email to