potiuk edited a comment on pull request #19189: URL: https://github.com/apache/airflow/pull/19189#issuecomment-950347134
Hey @uranusjr @kaxil @ashb (and others). I have an interesting case that I need your opinion on. This has been result of (rather heated) discussion we have in https://github.com/pypa/pip/issues/10556 . Saving you some need to read the (unpleaseant at times) discussion there - short context here (at least my view of it). Seems that the "recommended" way forward for any future installations of pip dependencies is via virtualenv. This is a bit implicit and (as I see it at least) not stated clearly and strongly enough to make it the "only" apprioach, but on the other hand it results in some unremovable warnings in our image that migh confuse our users. Since it seems that no-one had good answers to my questins, I decided to take a stab on it and see what it means for our images (both CI and PROD) to switch from the --user approach to uising venv. I already investigated and fixed a few issues connected (documented also in the PyPI discussion) and I plan to improve the recommendation (and possibly even to help to make it a bit more firm/stronger recommendation - especially for people who build docker images and want to get them optimized by size and follow "aribitrary user id" recommendations from OpenShift. It's mainly about copying venvs between stages in docker image building and applicability of re-using the single venv among mutliple users. I have a few observations and recomendation for other people trying to do it already, but one question remains - so far about our approach of creating venv by PythonVirtualenvOperator. The main problem is (and it is partially an Airflow bug, partially the way how virtualenv work) - that even currently PythonVirtualenvOperator will behave DIFFERENTLY when airflow is instaled as "system" package and when it is installed in virtualenv. This results in `tests/operators/test_python.py::TestPythonVirtualenvOperator::test_airflow_context` failing (you will see it failing in this PR), The problem is that if Airflow (and all its dependencies) are installed in virtualenv, then PythonVirtualenvOperator will behave wrongly. Specifically in this case it will fail if "dill" and "system_site_packages" are enabled (even without `dill` it will behave differently but with `dill` is pretty straightforward what happens). 1) First problem (smaller) - instead of "airflow" version of pip - it will use the "system" version of pip. In our case we have pip 23.1 installed in airflow venv, on the other hand pip 21.2.4 is installed in "debian-buster" by default. If airflow (and pip) is installed in venv, the new venv created will have (not surprisingly) pip 21.2.4 even if --system-site-packages is used. This will be different if airflow is installed as "system" package or with `--user` flag. It would be nice to get the same version as the one used in airflow venv, however - one might say "this is as expected" . Not sure if we want to do anything about it - I have asked pip maintainers if they have some solutions to that. We could do several things here: - we can accept that as "expected" behaviour, we could reinstall pip, setuptools, wheels to the same versions that are installed in airflow venv as well. I am not too woried about it - though I would love to hear your opinion. 2) Second problem (much bigger and actually at least a bit undocumented Airflow behaviour) - as the test shows, if airflow is installed in venv only, the new virtualenv will neither have no airflow no dill, no lazy-proxy installed (in the current PythonVirtualenv) - you would have to specify all those as requirements explicitly. This **might** be considered as expected behaviour, however - again - it will work fine when airlfow is installed as "System" package or with --user flag - becuase the --system-site-packages flag will take care of carrying those packages to the new environment (with --user flag it will even actually work WITHOUT --system-site-packages). It looks like this behaviour was made "Somewhat" consistent in case no ---system-site-packages were used - the python virtualenv has "if" when the `lazy-proxy` and `dill` are added to requirements in this case. But apparently the case when "Airflow" is installed in virtualenv is not handled similarly. So we have pretty inconsistent behaviour here and I would love to make it consistent (or at least precisely document the inconsistencies in those different cases). I tried to fix it in by adding `lazy-proxy` and `dill` even if `--system-site-packages` are not present (last fixup) but this is not a good fix - dill imports `pendulum` which is also needed in this case and also it does not handle the case of airflow macros and custom macros that will also be missing in this case because airflow is missing as well. I'd love to hear your opinion on that. I see three possible approaches: 1) we clearly document the current behaviour and inconsistencies - which is to expect `airflow` / `dill` and other needed packages - to be added explicitly when airflow is installed in virtualenv. This seems like most "bare" solution that will need the users to understand differences between airflow installed as system package and venv. 2) we handle "system_site_packages" parameter differently when airlfow is installed in venv. We could in this case not only use `--system-site-packages` but also make sure that all the packages that are in virtualenv of airflow are also present in the new virtualenv. This can be done rather easily with .pth approach or usinng `virtualenv-clone` package, This one seems to be best for consistency - no matter if airflow is installed as system package, --user, virtualenv, this parameter will make all tha packages of airflow, dill etc. available in the new venv. 3) we add third parameter "clone_virtualenv" or similar and only then clone the venv (if airflow is used as virtualenv) - we might need to discuss how it should interfere with "system_site_packages" parameter but this seems most "versatile" solution. Would love to hear what others think about it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
