potiuk commented on issue #41328: URL: https://github.com/apache/airflow/issues/41328#issuecomment-2284153528
> If the "latest" version of a package changes the hash for the respective venv would change too. The only thing I wanted to achieve was to paryse dynamic versions into absolute version like f.e.: > colormap>=1.0.0 or colormap are converted into colormap==1.1.0 Sure. The problem with that approach is that it has the potential of balloning a number of venvs - for example boto3 releases a new version every day or so - which means that if you are dynamically recreating the venv besed on latest version available in pypi and have boto3> x.y.z - it will create a new copy of the venv every day. Previously this happened only when you actuallly changed dependency requirements. But yes if their hashes will be different and stored separately, they will be essentially immutable (but there will be many more of those potentially and a strategy need to be worked out how to dispose the old ones essentially as they will grow in totally uncontrollable way potentially - without any DAG author action). I wonder what would be the proposal for that - because even then it could be that some tasks are still using the old version of venv with different hash, when the new one is being installed and used for a different task. > The only thing I wanted to achieve was to parse dynamic versions into absolute version like f.e.: > colormap>=1.0.0 or colormap are converted into colormap==1.1.0 > pip index versions colormap Pip index is not nearly enough. You have to run algorithm to resolve the dependencies - because new versions of requirements might have different limitations - so you actually have to perform full pip install resolution to perform such installation -you cannot just take "latest" of all dependencies that are specified with lower bound). For example if colrmap==1.1.0 has foobar<3.2 and you already had foobar 3.2 installed (because colormap == 1.0.0 did not have that limit) - pip will have to resolve the dependencies and decide whether to download foobar or not upgrade to the newer colormap (otherwise it will end up with conflict). So any time when you want to check for the "non-conflicting" dependencies, you basically have to do full dependency resolution with `--eager-upgrade` resolution strategy or perform a completely new installation (and dependency resolution) without looking what you have already installed in the target venv. This is the overhead that will need to happen on every single run of a task with such venv definition - regardless if cache is there, because you need to that resolution in order to calculate the new hash and compare it with the existing one. - this is why it's an overhead as sometimes such resolution might mean some back-tracking and downloading multiple versions of the same package - even if locally you already have current version of the dependency in cache. It can take even minutes sometimes (and this was the main reason why we wanted to implement caching - to save time on the dependency resolution and downloading). Dependency resolution in PyPI can be (and often is) quite time/network consuming. Basically you have two options now : 1) No cache - then you always get latest (at the expense of dependency resolution and downloading packages). Often slow and not predictable. 2) Cache - then you always get the "first matching requirements installed" at that machine - which makes it potentially inconsistent between runs on different machines (but with very little overhead of only first time resolution and installation) Essentially, what you want is option 3) 3) Cache but check if cache needs to be invalidated because some new dependencies have been released since the last time -> which is something in-between. Part of the process is faster (if nothing changed, you only pay the price of performing resolution - which might, or might not be slow and is somewhat unprecdictable (depends on packages released by 3rd-parties). Also with the drawback of potentially leaving behind many versions of venvs - where they can grow in non-controllable way over time. So we need to find a solution for managing those. But yes, if you want to pursue that and propose PR - feel free. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
