olegkachur-e commented on PR #53390:
URL: https://github.com/apache/airflow/pull/53390#issuecomment-3096504780

   @potiuk Thank you for the attention and detailed review! 
   
   > Could you please elaborate a bit more what exactly problem we are trying 
to solve here?
   
   The exact problem - there is as a side effect of running the 
`PythonVirtualenvOperator`- some files left, that are not cleaned-up. 
   My STR in breeze was to set the `PYTHONPYCACHEPREFIX="__pycache__"` and run 
operator with simple requirements. 
   After the run there was a directory left as 
`__pycache__/tmp/<venv_temp_name>`, so I suggest cleaning it up.
   I think it should reproduce without `PYTHONPYCACHEPREFIX` as well, it's just 
a bit harder to trace.
   
   
   > 
   > Also I am not sure what problem it solves to be honest - the `__pycache__` 
is generally very well rebuilt when needed. I can think of one case where it 
can be problematic - when you are using same `__pycache__` folders for 
different Python versions, you will have some version code errors.
   > 
   
   With behavior mentioned in the STR above, I see an issue with many operator 
runs, it may eventually pollute file system, exhaust resources on cloud 
providers environments.
   
   As for `__pycache__`  rebuilt when needed - I think here is a bit of a 
corner case here: venv created, executed in subprocess and deleted, so there is 
"nobody left" to rebuilt or clean it up (to my best understanding).
   
   
   > 
   > But there, I'd say a better solution might be to set 
PYTHONDONOTWRITEBYTECODE before launching interpreter. Yes it **might** be 
slower a bit because parsing and bytecode generation happens every time, but in 
vast majority of cases when tasks runs for seconds, this will be negligible 
overhead - unless you have vast amount of code to parse.
   > 
   
   
   I couldn't agree more! There was also my first intention. 
   Unfortunately with my experiment runs it doesn't completely fix the problem: 
there was a reproduction with "no uv" execution, so I decided to make an 
attempt to clean-up on each run, matching exact temp venv directory name that 
being used. 
   
   
   > There is a potential problem I see here. On Celery worker when you have 
multiple tasks in parallel, potentially the same `__pycache__` folder is used 
by parallel workers running. I am not sure if we can do what you want to do 
without getting into some race conditions.
   >
   
   So you mean `with TemporaryDirectory(prefix="venv") as tmp_dir:` potentially 
 could result from the same name on different workers?  
   Then in the worst case we will fail to delete again so have a log warning 
message or are there any other risks?
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to