potiuk commented on pull request #19189:
URL: https://github.com/apache/airflow/pull/19189#issuecomment-950347134


   Hey @uranusjr @kaxil @ashb (and others). I have an interesting case that I 
need your opinion on.
   This has been result of (rather heated) discussion we have in 
https://github.com/pypa/pip/issues/10556 .
   
   Saving you some  need to read the (unpleaseant at times) discussion there - 
short context here (at least my view of it).
   
   Seems that the "recommended" way forward for any future installations of pip 
dependencies is via virtualenv. This is a bit implicit and (as I see it at 
least) not stated clearly and strongly enough to make it the "only" apprioach, 
but on the other hand it results in some unremovable warnings in our image that 
migh confuse our users.
   
   Since ti seems that noone had good answers to my questiins, I decided to 
take a stab on it and see what it means for our images (both CI and PROD) to 
switch from the --user approach to uising venv.
   
   I already investigated and fixed a few issues connected (documented also in 
the PyPI discussion) and I plan to improve the recommendation (and possibly 
even to help to make it a bit more firm/stronger recommendation - especially 
for people who build docker images and want to get them optimized by size and 
follow "aribitrary user id" recommendations from OpenShift. It's mainly about 
copying venvs between stages in docker image building and applicability of 
re-using the single venv among mutliple users. I have a few observations and 
recomendation for other people trying to do it already, but one question 
remains - so far about our approach of creating venv by 
PythonVirtualenvOperator.
   
   The main problem is (and it is partially an Airflow bug, partially the way 
how virtualenv work) - that even currently PythonVirtualenvOperator will behave 
DIFFERENTLY when airflow is instaled as "system" package and when it is 
installed in virtualenv. This results in 
`tests/operators/test_python.py::TestPythonVirtualenvOperator::test_airflow_context`
 failing (you will see it failing in this PR),
   
   The problem is that if Airflow (and all its dependencies) are installed in 
virtualenv, then PythonVirtualenvOperator will behave wrongly. Specifically in 
this case it will fail if "dill" and "system_site_packages" are enabled (even 
without `dill` it will behave differently but with `dill` is pretty 
straightforward what happens).
   
   1) First problem (smaller) - instead of "airflow" version of pip - it will 
use the "system" version of pip. In our case we have pip 23.1 installed in 
airflow venv, on the other hand pip 21.2.4 is installed in "debian-buster" by 
default. If airflow (and pip) is installed in venv, the new venv created will 
have (not surprisingly) pip 21.2.4 even if --system-site-packages is used. This 
will be different if airflow is installed as "system" package or with `--user` 
flag. It would be nice to get the same version as the one used in airflow venv, 
however - one might say "this is as expected"  . Not sure if we want to do 
anything about it - I have asked pip maintainers if they have some solutions to 
that. We could do several things here: - we can accept that as "expected" 
behaviour, we could reinstall pip, setuptools, wheels to the same versions that 
are installed in airflow venv as well. I am not too woried about it - though I 
would love to hear your opinion.
   
   2) Second problem (much bigger and actually at least a bit undocumented 
Airflow behaviour) - as the test shows, if airflow is installed in venv only, 
the new virtualenv will neither have no airflow no dill, no lazy-proxy 
installed (in the current PythonVirtualenv) - you would have to specify all 
those as requirements explicitly. This **might** be considered as expected 
behaviour, however - again - it will work fine when airlfow is installed as 
"System" package or with --user flag - becuase the --system-site-packages flag 
will take care of carrying those packages to the new environment (with --user 
flag it will even actually work WITHOUT --system-site-packages). It looks like 
this behaviour was made "Somewhat" consistent in case no 
---system-site-packages were used - the python virtualenv has "if" when the 
`lazy-proxy` and `dill` are added to requirements in this case. But apparently 
the case when "Airflow" is  installed in virtualenv is not handled similarly.
   So we have pretty inconsistent behaviour here and I would love to make it 
consistent (or at least precisely document the inconsistencies in those 
different cases).
   
   I tried to fix it in by adding "lazy-proxy" and "dill` even if 
`--system-site-packages` are not present (last fixup) but this is not a good 
fix - dill imports `pendulum` which is also needed in this case and also it 
does not handle the case of airflow macros and custom macros that will also be 
missing in this case because airflow is missing as well. 
   
   I'd love to hear your opinion on that. I see three possible approaches:
   
   1) we clearly document the current behaviour and inconsistencies - is to 
expect `airflow` /`dill` and other needed packages - to be added explicitly 
when airflow is installed in virtualenv. This seems like most "bare" solution 
that will need the users to understand differences between airflow installed as 
system package and venv.
   
   2) we handle "system_site_packages" parameter differently when airlfow is 
installed in venv. We could in this case not only use `--system-site-packages" 
but also make sure that all the packages that are in virtualenv of airflow are 
also present in the new virtualenv. This can be done rather easily with .pth 
approach or usinng `virtualenv-clone` package, This one seems to be best for 
consistency - no matter if airflow is installed as system package, --user, 
virtualenv, this parameter will make all tha packages of airflow, dill etc. 
available in the new venv.
   
   3) we add third parameter "clone_virtualenv" or similar and only then clone 
the venv (if airflow is used as virtualenv) - we might need to discuss how it 
should interfere with "system_site_packages" parameter but this seems most 
"versatile" solution.
   
   Would love to hear what others think about it.
   
    
   
   
   
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to