Hey all,

We recently upgraded to Beam 2.63 from 2.50. After the upgrade, our unit
tests testing our runner saw a 4x-5x performance hit. It turned out it was
because for every pipeline run, the default `PipelineRunner.run_pipeline
invokes `PIpelineRunner.default_environment`[1] which eventually results in
a subprocess call to `pip freeze` to gather python requirements for later
logging [2].

This caught me by surprise and was very hard to debug since the massive
slowdown was due to using 2x more subprocesses than I specified for my unit
test runner, which resulted in my python processes thrashing. I've turned
off this logging feature for our runner, but just wanted to give a heads up
as any runner that uses the default `run_pipeline` method will incur this
cost. May be relevant to using the PrismRunner to replace the python
directrunner (or maybe y'all do want this check?)

Cheers,
Joey

[1]
https://github.com/apache/beam/blob/dd51c4cba108a0c425c37dfc28a81b3caf80d215/sdks/python/apache_beam/runners/runner.py#L182
[2]
https://github.com/apache/beam/blob/dd51c4cba108a0c425c37dfc28a81b3caf80d215/sdks/python/apache_beam/runners/portability/stager.py#L906

Reply via email to