Thanks for calling this out. I generally agree with you. I've found this feature to be generally quite useful for production jobs running in distributed environments. I have seen several issues which have been solved because of it (and similarly I have seen issues which would have benefited from it before its introduction). At the same time, I agree it is not worth the cost when running locally since you're not at the same risk of diverging environments.
I'd vote we disable this for Prism and can take that on as part of enabling prism as the default runner [1]. [1] WIP - https://github.com/apache/beam/pull/34612 On Wed, Jun 4, 2025 at 10:48 AM Joey Tran <joey.t...@schrodinger.com> wrote: > Hey all, > > We recently upgraded to Beam 2.63 from 2.50. After the upgrade, our unit > tests testing our runner saw a 4x-5x performance hit. It turned out it was > because for every pipeline run, the default `PipelineRunner.run_pipeline > invokes `PIpelineRunner.default_environment`[1] which eventually results in > a subprocess call to `pip freeze` to gather python requirements for later > logging [2]. > > This caught me by surprise and was very hard to debug since the massive > slowdown was due to using 2x more subprocesses than I specified for my unit > test runner, which resulted in my python processes thrashing. I've turned > off this logging feature for our runner, but just wanted to give a heads up > as any runner that uses the default `run_pipeline` method will incur this > cost. May be relevant to using the PrismRunner to replace the python > directrunner (or maybe y'all do want this check?) > > Cheers, > Joey > > [1] > https://github.com/apache/beam/blob/dd51c4cba108a0c425c37dfc28a81b3caf80d215/sdks/python/apache_beam/runners/runner.py#L182 > [2] > https://github.com/apache/beam/blob/dd51c4cba108a0c425c37dfc28a81b3caf80d215/sdks/python/apache_beam/runners/portability/stager.py#L906 >