potiuk commented on issue #41328:
URL: https://github.com/apache/airflow/issues/41328#issuecomment-2284153528

   > If the "latest" version of a package changes the hash for the respective 
venv would change too. The only thing I wanted to achieve was to paryse dynamic 
versions into absolute version like f.e.:
   > colormap>=1.0.0 or colormap are converted into colormap==1.1.0
   
   Sure. The problem with that approach is that it has the potential of 
balloning a number of venvs - for example boto3 releases a new version every 
day or so - which means that if you are dynamically recreating the venv besed 
on latest version available in pypi and have boto3> x.y.z - it will create a 
new copy of the venv every day. Previously this happened only when you 
actuallly changed dependency requirements. 
   
   But yes if their hashes will be different and stored separately, they will 
be essentially immutable (but there will be many more of those potentially and 
a strategy need to be worked out how to dispose the old ones essentially as 
they will grow in totally uncontrollable way potentially - without any DAG 
author action). I wonder what would be the proposal for that - because even 
then it could be that some tasks are still using the old version of venv with 
different hash, when the new one is being installed and used for a different 
task.
   
   > The only thing I wanted to achieve was to parse dynamic versions into 
absolute version like f.e.:
   > colormap>=1.0.0 or colormap are converted into colormap==1.1.0
   > pip index versions colormap
   
   Pip index is not nearly enough. You have to run algorithm to resolve the 
dependencies - because new versions of requirements might have different 
limitations - so you actually have to perform full pip install resolution to 
perform such installation  -you cannot just take "latest" of all dependencies 
that are specified with lower bound). For example if colrmap==1.1.0 has 
foobar<3.2 and you already had foobar 3.2 installed (because colormap == 1.0.0 
did not have that limit) - pip will have to resolve the dependencies and decide 
whether to download foobar or not upgrade to the newer colormap (otherwise it 
will end up with conflict). So any time when you want to check for the 
"non-conflicting" dependencies, you basically have to do full dependency 
resolution with `--eager-upgrade` resolution strategy or perform a completely 
new installation (and dependency resolution) without looking what you have 
already installed in the target venv.
   
   This is the overhead that will need to happen on every single run of a task 
with such venv definition - regardless if cache is there, because you need to 
that resolution in order to calculate the new hash and compare it with the 
existing one.  - this is why it's an overhead as sometimes such resolution 
might mean some back-tracking and downloading multiple versions of the same 
package - even if locally you already have current version of the dependency in 
cache. It can take even minutes  sometimes (and this was the main reason why we 
wanted to implement caching - to save time on the dependency resolution and 
downloading). 
   
   Dependency resolution in PyPI can be (and often is) quite time/network 
consuming. 
   
   Basically you have two options now :
   
   1) No cache - then you always get latest (at the expense of dependency 
resolution and downloading packages). Often slow and not predictable.
   
   2) Cache - then you always get the "first matching requirements installed" 
at that machine - which makes it potentially inconsistent between runs on 
different machines (but with very little overhead of only first time resolution 
and installation)
   
   Essentially, what you want is option 3)
   
   3) Cache but check if cache needs to be invalidated because some new 
dependencies have been released since the last time -> which is something 
in-between.  Part of the process is faster (if nothing changed, you only pay 
the price of performing resolution - which might, or might not be slow and is 
somewhat unprecdictable (depends on packages released by 3rd-parties).  Also 
with the drawback of potentially leaving behind many versions of venvs - where 
they can grow in non-controllable way over time. So we need to find a solution 
for managing those.
   
   But yes, if you want to pursue that and propose PR - feel free. 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to