Thanks for calling this out. I generally agree with you. I've found this
feature to be generally quite useful for production jobs running in
distributed environments. I have seen several issues which have been solved
because of it (and similarly I have seen issues which would have benefited
from it before its introduction). At the same time, I agree it is not worth
the cost when running locally since you're not at the same risk of
diverging environments.

I'd vote we disable this for Prism and can take that on as part of enabling
prism as the default runner [1].

[1] WIP - https://github.com/apache/beam/pull/34612

On Wed, Jun 4, 2025 at 10:48 AM Joey Tran <joey.t...@schrodinger.com> wrote:

> Hey all,
>
> We recently upgraded to Beam 2.63 from 2.50. After the upgrade, our unit
> tests testing our runner saw a 4x-5x performance hit. It turned out it was
> because for every pipeline run, the default `PipelineRunner.run_pipeline
> invokes `PIpelineRunner.default_environment`[1] which eventually results in
> a subprocess call to `pip freeze` to gather python requirements for later
> logging [2].
>
> This caught me by surprise and was very hard to debug since the massive
> slowdown was due to using 2x more subprocesses than I specified for my unit
> test runner, which resulted in my python processes thrashing. I've turned
> off this logging feature for our runner, but just wanted to give a heads up
> as any runner that uses the default `run_pipeline` method will incur this
> cost. May be relevant to using the PrismRunner to replace the python
> directrunner (or maybe y'all do want this check?)
>
> Cheers,
> Joey
>
> [1]
> https://github.com/apache/beam/blob/dd51c4cba108a0c425c37dfc28a81b3caf80d215/sdks/python/apache_beam/runners/runner.py#L182
> [2]
> https://github.com/apache/beam/blob/dd51c4cba108a0c425c37dfc28a81b3caf80d215/sdks/python/apache_beam/runners/portability/stager.py#L906
>

Reply via email to