* it has finer granularity level than AIP-46 - unlike in AIP-46 you could use different sets of dependencies *per task not per dag *(I did not finish that sentence)
On Fri, Aug 12, 2022 at 2:58 PM Jarek Potiuk <ja...@potiuk.com> wrote: > Hello everyone, > > *TL;DR;* I propose to extend our PythonVirtualenvOperator with "use > existing venv" feature and make it a viable way of handling some > multi-dependency sets using multiple pre-installed venvs. > > *More context:* > > I had this idea coming after a discussion in our Slack: > https://apache-airflow.slack.com/archives/CCV3FV9KL/p1660233834355179 > > My thoughts were - why don't we add support for "use existing venv" in > PythonVirtualenvOperator as first-class-citizen ? > > Currently (unless there are some tricks I am not aware of) or extend PVO, > the PVO will always attempt to create a virtualenv based on extra > requirements. And while it gives the users a possibility of having some > tasks use different dependencies, the drawback is that the venv is created > dynamically when tasks starts - potentially a lot of overhead for startup > time and some unpleasant failure scenarios - like networking problems, PyPI > or local repoi not available, automated (and unnoticed) upgrade of > dependencies. > > Those are basically the same problems that caused us to strongly > discourage our users in our Helm Chart to use _PIP_ADDITIONAL_DEPENDENCIES > in production and criticize the Community Helm Chart for dynamic > dependency installation they promote as a "valid" approach. Yet our PVO > currently does exactly this. > > We had some past discussions how this can be improved - with caching, or > using different images for different dependencies and similar - and even we > have > https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-46+Runtime+isolation+for+airflow+tasks+and+dag+parsing > proposal to use different images for different sets of requirements. > > *Proposal:* > > During the discussion yesterday I started to think a simpler solution is > possible and rather simple to implement by us and for users to use. > > Why not have different venvs preinstalled and let the PVO choose the one > that should be used? > > It does not invalidate AIP-46. AIP-46 serves a bit different purpose and > some cases cannot be handled this way - when you need different "system > level" dependencies for example) but it might be much simpler from > deployment point of view and allow it to handle "multi-dependency sets" for > Python libraries only with minimal deployment overhead (which AIP-46 > necessarily has). And I think it will be enough for a vast number of the > "multi-dependency-sets" cases. > > Why don't we allow the users to prepare those venvs upfront and simply > enable PVE to use them rather than create them dynamically ? > > *Advantages:* > > * it nicely handles cases where some of your tasks need a different set of > dependencies than others (for execution, not necessarily parsing at least > initially). > > * no startup time overhead needed as with current PVO > > * possible to run in both cases - "venv installation" and "docker image" > installation > > * it has finer granularity level than AIP-46 - unlike in AIP-46 you could > use different sets of dependencies > > * very easy to pull off for the users without modifying their > deployments,For local venv, you just create the venvs, For Docker image > case, your custom image needs to add several lines similar to: > > RUN python -m venv --system-site-packages PACKAGE1==NN PACKAGE2==NN > /opt/venv1 > RUN python -m venv --system-site-packages PACKAGE1==NN PACKAGE2==NN > /opt/venv2 > > and PythonVenvOperator should have extra "use_existing_venv=/opt/venv2") > parameter > > * we only need to manage ONE image (!) even if you have multiple sets of > dependencies (this has the advantage that it is actually LOWER overhead > than having separate images for each env -when it comes to > various resources overhead (same workers could handle multiple dependency > sets for examples, same image is reused by multiple PODs in K8S etc. ). > > * later (when AIP-43 (separate dag processor with ability to use different > processors for different subdirectories) is completed and AIP-46 is > approved/implemented, we could also extend DAG Parsing to be able to use > those predefined venvs for parsing. That would eliminate the need for local > imports and add support to even use different sets of libraries in > top-level code (per DAG, not per task). It would not solve different > "system" level dependencies - and for that AiP-46 is still a very valid > case. > > *Disadvantages:* > > I thought very hard about this one and I actually could not find any > disadvantages :) > > It's simple to implement, use and explain, it can be implemented very > quickly (like - in a few hours with tests and documentation I think) and > performance-wise it is better for any other solution (including AIP-46) > providing that the case is limited to different Python dependencies. > > But possibly there are things that I missed. It all looks too good to be > true, and I wonder why we do not have it already today - once I thought > about it, it seems very obvious. So I probably missed something. > > WDYT? > > J. > > > > > > > >