Re: [DISCUSS] "Use existing venv" support for PythonVirtualenvOperator as counterpart to AIP-46

Jarek Potiuk Fri, 12 Aug 2022 06:01:41 -0700

* it has finer granularity level than AIP-46 - unlike in AIP-46 you could
use different sets of dependencies *per task not per dag *(I did not finish
that sentence)



On Fri, Aug 12, 2022 at 2:58 PM Jarek Potiuk <ja...@potiuk.com> wrote:

> Hello everyone,
>
> *TL;DR;* I propose to extend our PythonVirtualenvOperator with "use
> existing venv" feature and make it a viable way of handling some
> multi-dependency sets using multiple pre-installed venvs.
>
> *More context:*
>
> I had this idea coming after a discussion in our Slack:
> https://apache-airflow.slack.com/archives/CCV3FV9KL/p1660233834355179
>
> My thoughts were - why don't we add support for "use existing venv" in
> PythonVirtualenvOperator as first-class-citizen ?
>
> Currently (unless there are some tricks I am not aware of) or extend PVO,
> the PVO will always attempt to create a virtualenv based on extra
> requirements. And while it gives the users a possibility of having some
> tasks use different dependencies, the drawback is that the venv is created
> dynamically when tasks starts - potentially a lot of overhead for  startup
> time and some unpleasant failure scenarios - like networking problems, PyPI
> or local repoi not available, automated (and unnoticed) upgrade of
> dependencies.
>
> Those are basically the same problems that caused us to strongly
> discourage our users in our Helm Chart to use _PIP_ADDITIONAL_DEPENDENCIES
> in production and criticize the  Community Helm Chart for dynamic
> dependency installation they promote as a "valid" approach. Yet our PVO
> currently does exactly this.
>
> We had some past discussions how this can be improved - with caching, or
> using different images for different dependencies and similar - and even we
> have
> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-46+Runtime+isolation+for+airflow+tasks+and+dag+parsing
> proposal to use different images for different sets of requirements.
>
> *Proposal:*
>
> During the discussion yesterday I started to think a simpler solution is
> possible and rather simple to implement by us and for users to use.
>
> Why not have different venvs preinstalled and let the PVO choose the one
> that should be used?
>
> It does not invalidate AIP-46. AIP-46 serves a bit different purpose and
> some cases cannot be handled this way - when you need different "system
> level" dependencies for example) but it might be much simpler from
> deployment point of view and allow it to handle "multi-dependency sets" for
> Python libraries only with minimal deployment overhead (which AIP-46
> necessarily has). And I think it will be enough for a vast number of the
> "multi-dependency-sets" cases.
>
> Why don't we allow the users to prepare those venvs upfront and simply
> enable PVE to use them rather than create them dynamically ?
>
> *Advantages:*
>
> * it nicely handles cases where some of your tasks need a different set of
> dependencies than others (for execution, not necessarily parsing at least
> initially).
>
> * no startup time overhead needed as with current PVO
>
> * possible to run in both cases - "venv installation" and "docker image"
> installation
>
> * it has finer granularity level than AIP-46 - unlike in AIP-46 you could
> use different sets of dependencies
>
> * very easy to pull off for the users without modifying their
> deployments,For local venv, you just create the venvs, For Docker image
> case, your custom image needs to add several lines similar to:
>
> RUN python -m venv --system-site-packages PACKAGE1==NN PACKAGE2==NN
> /opt/venv1
> RUN python -m venv --system-site-packages PACKAGE1==NN PACKAGE2==NN
> /opt/venv2
>
> and PythonVenvOperator should have extra "use_existing_venv=/opt/venv2")
> parameter
>
> * we only need to manage ONE image (!) even if you have multiple sets of
> dependencies (this has the advantage that it is actually LOWER overhead
> than having separate images for each env -when it comes to
> various resources overhead (same workers could handle multiple dependency
> sets for examples, same image is reused by multiple PODs in K8S etc. ).
>
> * later (when AIP-43 (separate dag processor with ability to use different
> processors for different subdirectories) is completed and AIP-46 is
> approved/implemented, we could also extend DAG Parsing to be able to use
> those predefined venvs for parsing. That would eliminate the need for local
> imports and add support to even use different sets of libraries in
> top-level code (per DAG, not per task). It would not solve different
> "system" level dependencies - and for that AiP-46 is still a very valid
> case.
>
> *Disadvantages:*
>
> I thought very hard about this one and I actually could not find any
> disadvantages :)
>
> It's simple to implement, use and explain, it can be implemented very
> quickly (like - in a few hours with tests and documentation I think) and
> performance-wise it is better for any other solution (including AIP-46)
> providing that the case is limited to different Python dependencies.
>
> But possibly there are things that I missed. It all looks too good to be
> true, and I wonder why we do not have it already today - once I thought
> about it, it seems very obvious. So I probably missed something.
>
> WDYT?
>
> J.
>
>
>
>
>
>
>
>

Re: [DISCUSS] "Use existing venv" support for PythonVirtualenvOperator as counterpart to AIP-46

Reply via email to