Re: [DISCUSS] "Use existing venv" support for PythonVirtualenvOperator as counterpart to AIP-46

Ash Berlin-Taylor Fri, 12 Aug 2022 06:15:40 -0700

Yes, this has been on my background idea list for an age -- I'd love tosee it happen!

Have you thought about how it would behave when you specify an existingvirtualenv and include requirements in the operator that are notalready installed there? Or would they be mutually exclusive? (I don'tmind either way, just wondering which way you are heading)


-ash

On Fri, Aug 12 2022 at 14:58:44 +02:00:00, Jarek Potiuk<ja...@potiuk.com> wrote:

Hello everyone,
*TL;DR;* I propose to extend our PythonVirtualenvOperator with "useexisting venv" feature and make it a viable way of handling somemulti-dependency sets using multiple pre-installed venvs.
*More context:*
I had this idea coming after a discussion in our Slack:<https://apache-airflow.slack.com/archives/CCV3FV9KL/p1660233834355179>
My thoughts were - why don't we add support for "use existing venv"in PythonVirtualenvOperator as first-class-citizen ?
Currently (unless there are some tricks I am not aware of) or extendPVO, the PVO will always attempt to create a virtualenv based onextra requirements. And while it gives the users a possibility ofhaving some tasks use different dependencies, the drawback is thatthe venv is created dynamically when tasks starts - potentially a lotof overhead for startup time and some unpleasant failure scenarios -like networking problems, PyPI or local repoi not available,automated (and unnoticed) upgrade of dependencies.
Those are basically the same problems that caused us to stronglydiscourage our users in our Helm Chart to use_PIP_ADDITIONAL_DEPENDENCIES in production and criticize theCommunity Helm Chart for dynamic dependency installation they promoteas a "valid" approach. Yet our PVO currently does exactly this.
We had some past discussions how this can be improved - with caching,or using different images for different dependencies and similar -and even we have<https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-46+Runtime+isolation+for+airflow+tasks+and+dag+parsing>proposal to use different images for different sets of requirements.
*Proposal:*
During the discussion yesterday I started to think a simpler solutionis possible and rather simple to implement by us and for users to use.
Why not have different venvs preinstalled and let the PVO choose theone that should be used?
It does not invalidate AIP-46. AIP-46 serves a bit different purposeand some cases cannot be handled this way - when you need different"system level" dependencies for example) but it might be much simplerfrom deployment point of view and allow it to handle"multi-dependency sets" for Python libraries only with minimaldeployment overhead (which AIP-46 necessarily has). And I think itwill be enough for a vast number of the "multi-dependency-sets" cases.
Why don't we allow the users to prepare those venvs upfront andsimply enable PVE to use them rather than create them dynamically ?
*Advantages:*
* it nicely handles cases where some of your tasks need a differentset of dependencies than others (for execution, not necessarilyparsing at least initially).
* no startup time overhead needed as with current PVO
* possible to run in both cases - "venv installation" and "dockerimage" installation
* it has finer granularity level than AIP-46 - unlike in AIP-46 youcould use different sets of dependencies
* very easy to pull off for the users without modifying theirdeployments,For local venv, you just create the venvs, For Dockerimage case, your custom image needs to add several lines similar to:
RUN python -m venv --system-site-packages PACKAGE1==NN PACKAGE2==NN/opt/venv1RUN python -m venv --system-site-packages PACKAGE1==NN PACKAGE2==NN/opt/venv2
and PythonVenvOperator should have extra"use_existing_venv=/opt/venv2") parameter
* we only need to manage ONE image (!) even if you have multiple setsof dependencies (this has the advantage that it is actually LOWERoverhead than having separate images for each env -when it comes tovarious resources overhead (same workers could handle multipledependency sets for examples, same image is reused by multiple PODsin K8S etc. ).
* later (when AIP-43 (separate dag processor with ability to usedifferent processors for different subdirectories) is completed andAIP-46 is approved/implemented, we could also extend DAG Parsing tobe able to use those predefined venvs for parsing. That wouldeliminate the need for local imports and add support to even usedifferent sets of libraries in top-level code (per DAG, not pertask). It would not solve different "system" level dependencies - andfor that AiP-46 is still a very valid case.
*Disadvantages:*
I thought very hard about this one and I actually could not find anydisadvantages :)
It's simple to implement, use and explain, it can be implemented veryquickly (like - in a few hours with tests and documentation I think)and performance-wise it is better for any other solution (includingAIP-46) providing that the case is limited to different Pythondependencies.
But possibly there are things that I missed. It all looks too good tobe true, and I wonder why we do not have it already today - once Ithought about it, it seems very obvious. So I probably missedsomething.
WDYT?

J.

Re: [DISCUSS] "Use existing venv" support for PythonVirtualenvOperator as counterpart to AIP-46

Reply via email to