[GitHub] [airflow] potiuk commented on a diff in pull request #33355: Add Python Virtualenv Operator Caching

via GitHub Wed, 13 Sep 2023 09:56:23 -0700


potiuk commented on code in PR #33355:
URL: https://github.com/apache/airflow/pull/33355#discussion_r1324802446



##########
airflow/operators/python.py:
##########
@@ -606,7 +614,60 @@ def _prepare_venv(self, venv_path: Path) -> None:
             index_urls=self.index_urls,
         )
 
+    def _calculate_cache_hash(self) -> str:

Review Comment:
   It's not only race condition. I think it's expected that things will change 
while running. The backfill case is actually valid. The "latest" you mention is 
right, but whan you run backfill for multiple dag runs over the span of hours, 
the `latest` might very easily change mid-flight. I do expect people to bump 
versions of dependencies frequently in the DAGs.  So it might well be that 
someone updates the venv while someone else runs the backfill and we have to 
have a way to handle that case. For now I do not see how we could handle the 
situation if we keep single venv in this case.
   
   But also there is much more important case. It's ofetn that people will be 
using  venvs for different tasks even in completely different DAGs. And if we 
allow for caching and specifyin where to cache the venv, then we have a very 
high potential of different people using the same paths to cache and causing 
conflicts. We have currently no mechanism (and we do not plan it) to avoid or 
solve such conflicts, if we do not use hashes.
   
   Imagine:
   
   ```
   DAG1:
       PVO(venv_cache_path=/cache/my_venv,requirements="x")
   
   some other team:
   
   DAG2:
       PVO(venv_cache_path=/cache/my_venv,requirements="y")
   ````
   
   When you get this and keep the venv in the same folder. some of the tasks 
will start randomly failing (depending on the state of the cache on the worker 
we are at). If cache is there created by D1 - they D2 will fail, if there is 
cache created by D2 - D1 tasks will start failing on that worker. The venv will 
be there but will be a completely different venv. In this case we cannot 
delete/recreate existing venv when we  see "wrong requirements"  - because the 
other DAG's tasks might be using it. So the only option we have in case we 
detect that we want "y" and we see "x" is to fail such task.
   
   I don't think @uranusjr your proposal has a good solution for that case. 
   
   On the other hand `hash` approach is WAY better - it solves it very nicely 
and it has multiple other advantages that enhance the effect of caching and 
decrease venv footprint by allowing to share the same venve between seemingly 
unrelated tasks.
   
   This is precisely why pre-commit hooks have rather small footprint because 
this is exactly how similar 'hash" approach works there - multiple unrelated 
pre-commits with same requiremetns share the same venv under-the-hood.
   
   The way how hash approach will work:
   
   * it will not fail randomly - it will **just** work
   * even better - it will reuse DAG1s task venv in DAG2 if DAG2 will also have 
"x" as requirement
   * the organisation might even choose to deticate a volume for all caches 
this way - to reuse and optimise caching and ask ALL their teams to use the 
same CACHE_FOLDER -this way there is no need to agree additional namspacing and 
conventions between the teams. The "hash" will automaticaly create or reuse the 
venv when needed in the most optimized way possible.
   
   I really don't see how we can implement it without using hash (but maybe i 
am missing something?).
   
   
   
   
   
   
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [airflow] potiuk commented on a diff in pull request #33355: Add Python Virtualenv Operator Caching

Reply via email to