[GitHub] [airflow] potiuk commented on a diff in pull request #33355: Add Python Virtualenv Operator Caching

via GitHub Tue, 12 Sep 2023 16:44:07 -0700


potiuk commented on code in PR #33355:
URL: https://github.com/apache/airflow/pull/33355#discussion_r1323758553



##########
airflow/operators/python.py:
##########
@@ -606,7 +614,60 @@ def _prepare_venv(self, venv_path: Path) -> None:
             index_urls=self.index_urls,
         )
 
+    def _calculate_cache_hash(self) -> str:

Review Comment:
   > I wonder if this is that worthwhile… Instead of relying on the directory 
name, we could just dump the values into a JSON file inside the virtual 
environment directory, and read it out later for cache validation. That could 
help debugging if caching doesn’t work as expected too.
   
   I agree with @jens-scheffler-bosch "write once, use many" statement.
   
   Validation is not nearly enough. What will you do if the validation fails? 
Are we going to delete the venv? or recreate it? What happens if we are running 
in Celery Worker (multiple tasks) and each of them are using "almost" the same 
environment. For example someone submitted a change to the DAG that will bump a 
dependency version and this new task is scheduled, but in the same Celery 
worker we are still running an old versions of the same task from the previous 
version of the DAG.
   
   Should we delete the venv while another task is running and recreate it? Or 
should we fail the task? Or maybe wait until the other task completes (how?)
   
   I think really - venvs should be "write once, use many" and immutable after 
they are created. This is precisely what @jens-scheffler-bosch  addressed well 
that even if we have two task attempting to create identical venv, we have the 
file lock that prevents them to do so - and only one will create it and the 
other will use it. But when it is already created you should not modify 
existing venv, this is a different story - because you don't want to prevent 
parallel "read/modify" - you want to prevent parallel "create". You actually 
want multiple tasks to use the same venv in parallell once created - that's 
almost the whole point of it.
   
   IMHI calidation is not cutting it because there is no good action that you 
can attempt once validation fails. We MUST have different venvs if they are 
different. 
   
   Interestingly - this is for example the very same thing what `pre-commit` 
does. PRe-commit also calculates hash of the venv - based on the pre-commit 
definition. If you have two or more tasks with the same "dependencies" and 
"python version" - they will re-use existing venv and they are even re-used 
between multiple branches you might work on (see ~/.cache/pre-commit)  and they 
also get fully recreated with a different hash when anything changes.
   



##########
airflow/operators/python.py:
##########
@@ -606,7 +614,60 @@ def _prepare_venv(self, venv_path: Path) -> None:
             index_urls=self.index_urls,
         )
 
+    def _calculate_cache_hash(self) -> str:

Review Comment:
   > I wonder if this is that worthwhile… Instead of relying on the directory 
name, we could just dump the values into a JSON file inside the virtual 
environment directory, and read it out later for cache validation. That could 
help debugging if caching doesn’t work as expected too.
   
   I agree with @jens-scheffler-bosch "write once, use many" statement.
   
   Validation is not nearly enough. What will you do if the validation fails? 
Are we going to delete the venv? or recreate it? What happens if we are running 
in Celery Worker (multiple tasks) and each of them are using "almost" the same 
environment. For example someone submitted a change to the DAG that will bump a 
dependency version and this new task is scheduled, but in the same Celery 
worker we are still running an old versions of the same task from the previous 
version of the DAG.
   
   Should we delete the venv while another task is running and recreate it? Or 
should we fail the task? Or maybe wait until the other task completes (how?)
   
   I think really - venvs should be "write once, use many" and immutable after 
they are created. This is precisely what @jens-scheffler-bosch  addressed well 
that even if we have two task attempting to create identical venv, we have the 
file lock that prevents them to do so - and only one will create it and the 
other will use it. But when it is already created you should not modify 
existing venv, this is a different story - because you don't want to prevent 
parallel "read/modify" - you want to prevent parallel "create". You actually 
want multiple tasks to use the same venv in parallell once created - that's 
almost the whole point of it.
   
   IMHI calidation is not cutting it because there is no good action that you 
can attempt once validation fails. We MUST have different venvs if they are 
different. 
   
   Interestingly - this is for example the very same thing what `pre-commit` 
does. PRe-commit also calculates hash of the venv - based on the pre-commit 
definition. If you have two or more tasks with the same "dependencies" and 
"python version" - they will re-use existing venv and they are even re-used 
between multiple branches you might work on (see ~/.cache/pre-commit)  and they 
also get fully recreated with a different hash when anything changes.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [airflow] potiuk commented on a diff in pull request #33355: Add Python Virtualenv Operator Caching

Reply via email to