I feel like this is a great alternative at the price of a very moderate
effort. (I'd be glad to help with it).

Mutually exclusive sounds good to me as well.

Best,
Pierre

Le ven. 12 août 2022 à 15:23, Jarek Potiuk <ja...@potiuk.com> a écrit :

> Mutually exclusive. I think that has the nice property of forcing people
> to prepare immutable venvs upfront.
>
> On Fri, Aug 12, 2022 at 3:15 PM Ash Berlin-Taylor <a...@apache.org> wrote:
>
>> Yes, this has been on my background idea list for an age -- I'd love to
>> see it happen!
>>
>> Have you thought about how it would behave when you specify an existing
>> virtualenv and include requirements in the operator that are not already
>> installed there? Or would they be mutually exclusive? (I don't mind either
>> way, just wondering which way you are heading)
>>
>> -ash
>>
>> On Fri, Aug 12 2022 at 14:58:44 +02:00:00, Jarek Potiuk <ja...@potiuk.com>
>> wrote:
>>
>> Hello everyone,
>>
>> *TL;DR;* I propose to extend our PythonVirtualenvOperator with "use
>> existing venv" feature and make it a viable way of handling some
>> multi-dependency sets using multiple pre-installed venvs.
>>
>> *More context:*
>>
>> I had this idea coming after a discussion in our Slack:
>> https://apache-airflow.slack.com/archives/CCV3FV9KL/p1660233834355179
>>
>> My thoughts were - why don't we add support for "use existing venv" in
>> PythonVirtualenvOperator as first-class-citizen ?
>>
>> Currently (unless there are some tricks I am not aware of) or extend PVO,
>> the PVO will always attempt to create a virtualenv based on extra
>> requirements. And while it gives the users a possibility of having some
>> tasks use different dependencies, the drawback is that the venv is created
>> dynamically when tasks starts - potentially a lot of overhead for  startup
>> time and some unpleasant failure scenarios - like networking problems, PyPI
>> or local repoi not available, automated (and unnoticed) upgrade of
>> dependencies.
>>
>> Those are basically the same problems that caused us to strongly
>> discourage our users in our Helm Chart to use _PIP_ADDITIONAL_DEPENDENCIES
>> in production and criticize the  Community Helm Chart for dynamic
>> dependency installation they promote as a "valid" approach. Yet our PVO
>> currently does exactly this.
>>
>> We had some past discussions how this can be improved - with caching, or
>> using different images for different dependencies and similar - and even we
>> have
>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-46+Runtime+isolation+for+airflow+tasks+and+dag+parsing
>> proposal to use different images for different sets of requirements.
>>
>> *Proposal:*
>>
>> During the discussion yesterday I started to think a simpler solution is
>> possible and rather simple to implement by us and for users to use.
>>
>> Why not have different venvs preinstalled and let the PVO choose the one
>> that should be used?
>>
>> It does not invalidate AIP-46. AIP-46 serves a bit different purpose and
>> some cases cannot be handled this way - when you need different "system
>> level" dependencies for example) but it might be much simpler from
>> deployment point of view and allow it to handle "multi-dependency sets" for
>> Python libraries only with minimal deployment overhead (which AIP-46
>> necessarily has). And I think it will be enough for a vast number of the
>> "multi-dependency-sets" cases.
>>
>> Why don't we allow the users to prepare those venvs upfront and simply
>> enable PVE to use them rather than create them dynamically ?
>>
>> *Advantages:*
>>
>> * it nicely handles cases where some of your tasks need a different set
>> of dependencies than others (for execution, not necessarily parsing at
>> least initially).
>>
>> * no startup time overhead needed as with current PVO
>>
>> * possible to run in both cases - "venv installation" and "docker image"
>> installation
>>
>> * it has finer granularity level than AIP-46 - unlike in AIP-46 you could
>> use different sets of dependencies
>>
>> * very easy to pull off for the users without modifying their
>> deployments,For local venv, you just create the venvs, For Docker image
>> case, your custom image needs to add several lines similar to:
>>
>> RUN python -m venv --system-site-packages PACKAGE1==NN PACKAGE2==NN
>> /opt/venv1
>> RUN python -m venv --system-site-packages PACKAGE1==NN PACKAGE2==NN
>> /opt/venv2
>>
>> and PythonVenvOperator should have extra "use_existing_venv=/opt/venv2")
>> parameter
>>
>> * we only need to manage ONE image (!) even if you have multiple sets of
>> dependencies (this has the advantage that it is actually LOWER overhead
>> than having separate images for each env -when it comes to
>> various resources overhead (same workers could handle multiple dependency
>> sets for examples, same image is reused by multiple PODs in K8S etc. ).
>>
>> * later (when AIP-43 (separate dag processor with ability to use
>> different processors for different subdirectories) is completed and AIP-46
>> is approved/implemented, we could also extend DAG Parsing to be able to use
>> those predefined venvs for parsing. That would eliminate the need for local
>> imports and add support to even use different sets of libraries in
>> top-level code (per DAG, not per task). It would not solve different
>> "system" level dependencies - and for that AiP-46 is still a very valid
>> case.
>>
>> *Disadvantages:*
>>
>> I thought very hard about this one and I actually could not find any
>> disadvantages :)
>>
>> It's simple to implement, use and explain, it can be implemented very
>> quickly (like - in a few hours with tests and documentation I think) and
>> performance-wise it is better for any other solution (including AIP-46)
>> providing that the case is limited to different Python dependencies.
>>
>> But possibly there are things that I missed. It all looks too good to be
>> true, and I wonder why we do not have it already today - once I thought
>> about it, it seems very obvious. So I probably missed something.
>>
>> WDYT?
>>
>> J.
>>
>>
>>
>>
>>
>>
>>
>>

Reply via email to