One thing I thought of (but never bothered to write about) is to introduce a separate operator instead, say ExternalPythonOperator (bike shedding on name is welcomed), that explicitly takes a path to the interpreter (say in a virtual environment) and just use that to run the code. This also enables users to create a virtual environment upfront, but avoids needing to overload PythonVirtualenvOperator for the purpose. This also opens an extra use case that you can use any Python installation to run the code (say a custom-compiled interpreter), although nobody asked about that.
TP > On 13 Aug 2022, at 02:52, Jeambrun Pierre <pierrejb...@gmail.com> wrote: > > I feel like this is a great alternative at the price of a very moderate > effort. (I'd be glad to help with it). > > Mutually exclusive sounds good to me as well. > > Best, > Pierre > > Le ven. 12 août 2022 à 15:23, Jarek Potiuk <ja...@potiuk.com > <mailto:ja...@potiuk.com>> a écrit : > Mutually exclusive. I think that has the nice property of forcing people to > prepare immutable venvs upfront. > > On Fri, Aug 12, 2022 at 3:15 PM Ash Berlin-Taylor <a...@apache.org > <mailto:a...@apache.org>> wrote: > Yes, this has been on my background idea list for an age -- I'd love to see > it happen! > > Have you thought about how it would behave when you specify an existing > virtualenv and include requirements in the operator that are not already > installed there? Or would they be mutually exclusive? (I don't mind either > way, just wondering which way you are heading) > > -ash > > On Fri, Aug 12 2022 at 14:58:44 +02:00:00, Jarek Potiuk <ja...@potiuk.com > <mailto:ja...@potiuk.com>> wrote: >> Hello everyone, >> >> TL;DR; I propose to extend our PythonVirtualenvOperator with "use existing >> venv" feature and make it a viable way of handling some multi-dependency >> sets using multiple pre-installed venvs. >> >> More context: >> >> I had this idea coming after a discussion in our Slack: >> https://apache-airflow.slack.com/archives/CCV3FV9KL/p1660233834355179 >> <https://apache-airflow.slack.com/archives/CCV3FV9KL/p1660233834355179> >> >> My thoughts were - why don't we add support for "use existing venv" in >> PythonVirtualenvOperator as first-class-citizen ? >> >> Currently (unless there are some tricks I am not aware of) or extend PVO, >> the PVO will always attempt to create a virtualenv based on extra >> requirements. And while it gives the users a possibility of having some >> tasks use different dependencies, the drawback is that the venv is created >> dynamically when tasks starts - potentially a lot of overhead for startup >> time and some unpleasant failure scenarios - like networking problems, PyPI >> or local repoi not available, automated (and unnoticed) upgrade of >> dependencies. >> >> Those are basically the same problems that caused us to strongly discourage >> our users in our Helm Chart to use _PIP_ADDITIONAL_DEPENDENCIES in >> production and criticize the Community Helm Chart for dynamic dependency >> installation they promote as a "valid" approach. Yet our PVO currently does >> exactly this. >> >> We had some past discussions how this can be improved - with caching, or >> using different images for different dependencies and similar - and even we >> have >> https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-46+Runtime+isolation+for+airflow+tasks+and+dag+parsing >> >> <https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-46+Runtime+isolation+for+airflow+tasks+and+dag+parsing> >> proposal to use different images for different sets of requirements. >> >> Proposal: >> >> During the discussion yesterday I started to think a simpler solution is >> possible and rather simple to implement by us and for users to use. >> >> Why not have different venvs preinstalled and let the PVO choose the one >> that should be used? >> >> It does not invalidate AIP-46. AIP-46 serves a bit different purpose and >> some cases cannot be handled this way - when you need different "system >> level" dependencies for example) but it might be much simpler from >> deployment point of view and allow it to handle "multi-dependency sets" for >> Python libraries only with minimal deployment overhead (which AIP-46 >> necessarily has). And I think it will be enough for a vast number of the >> "multi-dependency-sets" cases. >> >> Why don't we allow the users to prepare those venvs upfront and simply >> enable PVE to use them rather than create them dynamically ? >> >> Advantages: >> >> * it nicely handles cases where some of your tasks need a different set of >> dependencies than others (for execution, not necessarily parsing at least >> initially). >> >> * no startup time overhead needed as with current PVO >> >> * possible to run in both cases - "venv installation" and "docker image" >> installation >> >> * it has finer granularity level than AIP-46 - unlike in AIP-46 you could >> use different sets of dependencies >> >> * very easy to pull off for the users without modifying their >> deployments,For local venv, you just create the venvs, For Docker image >> case, your custom image needs to add several lines similar to: >> >> RUN python -m venv --system-site-packages PACKAGE1==NN PACKAGE2==NN >> /opt/venv1 >> RUN python -m venv --system-site-packages PACKAGE1==NN PACKAGE2==NN >> /opt/venv2 >> >> and PythonVenvOperator should have extra "use_existing_venv=/opt/venv2") >> parameter >> >> * we only need to manage ONE image (!) even if you have multiple sets of >> dependencies (this has the advantage that it is actually LOWER overhead than >> having separate images for each env -when it comes to various resources >> overhead (same workers could handle multiple dependency sets for examples, >> same image is reused by multiple PODs in K8S etc. ). >> >> * later (when AIP-43 (separate dag processor with ability to use different >> processors for different subdirectories) is completed and AIP-46 is >> approved/implemented, we could also extend DAG Parsing to be able to use >> those predefined venvs for parsing. That would eliminate the need for local >> imports and add support to even use different sets of libraries in top-level >> code (per DAG, not per task). It would not solve different "system" level >> dependencies - and for that AiP-46 is still a very valid case. >> >> Disadvantages: >> >> I thought very hard about this one and I actually could not find any >> disadvantages :) >> >> It's simple to implement, use and explain, it can be implemented very >> quickly (like - in a few hours with tests and documentation I think) and >> performance-wise it is better for any other solution (including AIP-46) >> providing that the case is limited to different Python dependencies. >> >> But possibly there are things that I missed. It all looks too good to be >> true, and I wonder why we do not have it already today - once I thought >> about it, it seems very obvious. So I probably missed something. >> >> WDYT? >> >> J. >> >> >> >> >> >> >>