Another option would be to change the PythonOperator/@task to take a `python` argument (which also does change the behaviour of _that_ operator a lot with or without that argument if we did that.)
On 17 August 2022 15:46:52 BST, Jarek Potiuk <ja...@potiuk.com> wrote: >Yeah. TP - I like that explicit separation. It's much cleaner. I still >have to think about the name though. While I see where >ExternalPythonOperator comes from, It sounds a bit less than obvious. >I think the name should somehow contain "Environment" because very few >people realise that running Python from a virtualenv actually >implicitly "activates" the venv. >I think maybe deprecating the old PythonVirtualenvOperator and >introducing two new operators: PythonInCreatedVirtualEnvOperator, >PythonInExistingVirtualEnvOperator ? Not exactly those names - they >are too long - but something like that. Maybe we should get rid of >Python in the name at all ? > >BTW. I think we should generally do more of the discussions here and >express our thoughts about Airflow here. Even if there are no answers >or interest immediately, I think that it makes sense to do a bit of a >melting pot that sometimes might produce some cool (or rather hot) >stuff as a result. > >On Wed, Aug 17, 2022 at 8:45 AM Tzu-ping Chung <t...@astronomer.io.invalid> >wrote: >> >> One thing I thought of (but never bothered to write about) is to introduce a >> separate operator instead, say ExternalPythonOperator (bike shedding on name >> is welcomed), that explicitly takes a path to the interpreter (say in a >> virtual environment) and just use that to run the code. This also enables >> users to create a virtual environment upfront, but avoids needing to >> overload PythonVirtualenvOperator for the purpose. This also opens an extra >> use case that you can use any Python installation to run the code (say a >> custom-compiled interpreter), although nobody asked about that. >> >> TP >> >> >> On 13 Aug 2022, at 02:52, Jeambrun Pierre <pierrejb...@gmail.com> wrote: >> >> I feel like this is a great alternative at the price of a very moderate >> effort. (I'd be glad to help with it). >> >> Mutually exclusive sounds good to me as well. >> >> Best, >> Pierre >> >> Le ven. 12 août 2022 à 15:23, Jarek Potiuk <ja...@potiuk.com> a écrit : >>> >>> Mutually exclusive. I think that has the nice property of forcing people to >>> prepare immutable venvs upfront. >>> >>> On Fri, Aug 12, 2022 at 3:15 PM Ash Berlin-Taylor <a...@apache.org> wrote: >>>> >>>> Yes, this has been on my background idea list for an age -- I'd love to >>>> see it happen! >>>> >>>> Have you thought about how it would behave when you specify an existing >>>> virtualenv and include requirements in the operator that are not already >>>> installed there? Or would they be mutually exclusive? (I don't mind either >>>> way, just wondering which way you are heading) >>>> >>>> -ash >>>> >>>> On Fri, Aug 12 2022 at 14:58:44 +02:00:00, Jarek Potiuk <ja...@potiuk.com> >>>> wrote: >>>> >>>> Hello everyone, >>>> >>>> TL;DR; I propose to extend our PythonVirtualenvOperator with "use existing >>>> venv" feature and make it a viable way of handling some multi-dependency >>>> sets using multiple pre-installed venvs. >>>> >>>> More context: >>>> >>>> I had this idea coming after a discussion in our Slack: >>>> https://apache-airflow.slack.com/archives/CCV3FV9KL/p1660233834355179 >>>> >>>> My thoughts were - why don't we add support for "use existing venv" in >>>> PythonVirtualenvOperator as first-class-citizen ? >>>> >>>> Currently (unless there are some tricks I am not aware of) or extend PVO, >>>> the PVO will always attempt to create a virtualenv based on extra >>>> requirements. And while it gives the users a possibility of having some >>>> tasks use different dependencies, the drawback is that the venv is created >>>> dynamically when tasks starts - potentially a lot of overhead for startup >>>> time and some unpleasant failure scenarios - like networking problems, >>>> PyPI or local repoi not available, automated (and unnoticed) upgrade of >>>> dependencies. >>>> >>>> Those are basically the same problems that caused us to strongly >>>> discourage our users in our Helm Chart to use _PIP_ADDITIONAL_DEPENDENCIES >>>> in production and criticize the Community Helm Chart for dynamic >>>> dependency installation they promote as a "valid" approach. Yet our PVO >>>> currently does exactly this. >>>> >>>> We had some past discussions how this can be improved - with caching, or >>>> using different images for different dependencies and similar - and even >>>> we have >>>> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-46+Runtime+isolation+for+airflow+tasks+and+dag+parsing >>>> proposal to use different images for different sets of requirements. >>>> >>>> Proposal: >>>> >>>> During the discussion yesterday I started to think a simpler solution is >>>> possible and rather simple to implement by us and for users to use. >>>> >>>> Why not have different venvs preinstalled and let the PVO choose the one >>>> that should be used? >>>> >>>> It does not invalidate AIP-46. AIP-46 serves a bit different purpose and >>>> some cases cannot be handled this way - when you need different "system >>>> level" dependencies for example) but it might be much simpler from >>>> deployment point of view and allow it to handle "multi-dependency sets" >>>> for Python libraries only with minimal deployment overhead (which AIP-46 >>>> necessarily has). And I think it will be enough for a vast number of the >>>> "multi-dependency-sets" cases. >>>> >>>> Why don't we allow the users to prepare those venvs upfront and simply >>>> enable PVE to use them rather than create them dynamically ? >>>> >>>> Advantages: >>>> >>>> * it nicely handles cases where some of your tasks need a different set of >>>> dependencies than others (for execution, not necessarily parsing at least >>>> initially). >>>> >>>> * no startup time overhead needed as with current PVO >>>> >>>> * possible to run in both cases - "venv installation" and "docker image" >>>> installation >>>> >>>> * it has finer granularity level than AIP-46 - unlike in AIP-46 you could >>>> use different sets of dependencies >>>> >>>> * very easy to pull off for the users without modifying their >>>> deployments,For local venv, you just create the venvs, For Docker image >>>> case, your custom image needs to add several lines similar to: >>>> >>>> RUN python -m venv --system-site-packages PACKAGE1==NN PACKAGE2==NN >>>> /opt/venv1 >>>> RUN python -m venv --system-site-packages PACKAGE1==NN PACKAGE2==NN >>>> /opt/venv2 >>>> >>>> and PythonVenvOperator should have extra "use_existing_venv=/opt/venv2") >>>> parameter >>>> >>>> * we only need to manage ONE image (!) even if you have multiple sets of >>>> dependencies (this has the advantage that it is actually LOWER overhead >>>> than having separate images for each env -when it comes to various >>>> resources overhead (same workers could handle multiple dependency sets for >>>> examples, same image is reused by multiple PODs in K8S etc. ). >>>> >>>> * later (when AIP-43 (separate dag processor with ability to use different >>>> processors for different subdirectories) is completed and AIP-46 is >>>> approved/implemented, we could also extend DAG Parsing to be able to use >>>> those predefined venvs for parsing. That would eliminate the need for >>>> local imports and add support to even use different sets of libraries in >>>> top-level code (per DAG, not per task). It would not solve different >>>> "system" level dependencies - and for that AiP-46 is still a very valid >>>> case. >>>> >>>> Disadvantages: >>>> >>>> I thought very hard about this one and I actually could not find any >>>> disadvantages :) >>>> >>>> It's simple to implement, use and explain, it can be implemented very >>>> quickly (like - in a few hours with tests and documentation I think) and >>>> performance-wise it is better for any other solution (including AIP-46) >>>> providing that the case is limited to different Python dependencies. >>>> >>>> But possibly there are things that I missed. It all looks too good to be >>>> true, and I wonder why we do not have it already today - once I thought >>>> about it, it seems very obvious. So I probably missed something. >>>> >>>> WDYT? >>>> >>>> J. >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>