olegkachur-e commented on PR #53390: URL: https://github.com/apache/airflow/pull/53390#issuecomment-3096504780
@potiuk Thank you for the attention and detailed review! > Could you please elaborate a bit more what exactly problem we are trying to solve here? The exact problem - there is as a side effect of running the `PythonVirtualenvOperator`- some files left, that are not cleaned-up. My STR in breeze was to set the `PYTHONPYCACHEPREFIX="__pycache__"` and run operator with simple requirements. After the run there was a directory left as `__pycache__/tmp/<venv_temp_name>`, so I suggest cleaning it up. I think it should reproduce without `PYTHONPYCACHEPREFIX` as well, it's just a bit harder to trace. > > Also I am not sure what problem it solves to be honest - the `__pycache__` is generally very well rebuilt when needed. I can think of one case where it can be problematic - when you are using same `__pycache__` folders for different Python versions, you will have some version code errors. > With behavior mentioned in the STR above, I see an issue with many operator runs, it may eventually pollute file system, exhaust resources on cloud providers environments. As for `__pycache__` rebuilt when needed - I think here is a bit of a corner case here: venv created, executed in subprocess and deleted, so there is "nobody left" to rebuilt or clean it up (to my best understanding). > > But there, I'd say a better solution might be to set PYTHONDONOTWRITEBYTECODE before launching interpreter. Yes it **might** be slower a bit because parsing and bytecode generation happens every time, but in vast majority of cases when tasks runs for seconds, this will be negligible overhead - unless you have vast amount of code to parse. > I couldn't agree more! There was also my first intention. Unfortunately with my experiment runs it doesn't completely fix the problem: there was a reproduction with "no uv" execution, so I decided to make an attempt to clean-up on each run, matching exact temp venv directory name that being used. > There is a potential problem I see here. On Celery worker when you have multiple tasks in parallel, potentially the same `__pycache__` folder is used by parallel workers running. I am not sure if we can do what you want to do without getting into some race conditions. > So you mean `with TemporaryDirectory(prefix="venv") as tmp_dir:` potentially could result from the same name on different workers? Then in the worst case we will fail to delete again so have a log warning message or are there any other risks? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
