Re: [python] subprocess call of "pip freeze" per pipeline

Joey Tran Thu, 05 Jun 2025 07:19:52 -0700

Would it be worth considering turning it off by default and making it
opt-in? I can imagine it's very useful, but a runner needs to implement
checking the generated requirements file, right? In which case, the default
behavior incurs a cost while not fully implementing the debugging behavior.


Another example, the TrivialRunner[1] will run this pip freeze subprocess
on every pipeline run but I don't think it checks the generated file.

(I don't feel strongly either way FWIW)

[1]
https://github.com/apache/beam/blob/b7f2e1611556cf2dab7e9a901d3477023cd71294/sdks/python/apache_beam/runners/trivial_runner.py#L47

On Thu, Jun 5, 2025 at 9:59 AM Danny McCormick via dev <dev@beam.apache.org>
wrote:

> Thanks for calling this out. I generally agree with you. I've found this
> feature to be generally quite useful for production jobs running in
> distributed environments. I have seen several issues which have been solved
> because of it (and similarly I have seen issues which would have benefited
> from it before its introduction). At the same time, I agree it is not worth
> the cost when running locally since you're not at the same risk of
> diverging environments.
>
> I'd vote we disable this for Prism and can take that on as part of
> enabling prism as the default runner [1].
>
> [1] WIP - https://github.com/apache/beam/pull/34612
>
> On Wed, Jun 4, 2025 at 10:48 AM Joey Tran <joey.t...@schrodinger.com>
> wrote:
>
>> Hey all,
>>
>> We recently upgraded to Beam 2.63 from 2.50. After the upgrade, our unit
>> tests testing our runner saw a 4x-5x performance hit. It turned out it was
>> because for every pipeline run, the default `PipelineRunner.run_pipeline
>> invokes `PIpelineRunner.default_environment`[1] which eventually results in
>> a subprocess call to `pip freeze` to gather python requirements for later
>> logging [2].
>>
>> This caught me by surprise and was very hard to debug since the massive
>> slowdown was due to using 2x more subprocesses than I specified for my unit
>> test runner, which resulted in my python processes thrashing. I've turned
>> off this logging feature for our runner, but just wanted to give a heads up
>> as any runner that uses the default `run_pipeline` method will incur this
>> cost. May be relevant to using the PrismRunner to replace the python
>> directrunner (or maybe y'all do want this check?)
>>
>> Cheers,
>> Joey
>>
>> [1]
>> https://github.com/apache/beam/blob/dd51c4cba108a0c425c37dfc28a81b3caf80d215/sdks/python/apache_beam/runners/runner.py#L182
>> [2]
>> https://github.com/apache/beam/blob/dd51c4cba108a0c425c37dfc28a81b3caf80d215/sdks/python/apache_beam/runners/portability/stager.py#L906
>>
>

Re: [python] subprocess call of "pip freeze" per pipeline

Reply via email to