tvalentyn opened a new issue, #26792: URL: https://github.com/apache/beam/issues/26792
### What would you like to happen? https://github.com/apache/beam/pull/16658 made a change to Python SDK harness container boot sequence to launch SDK processes in separately created virtual environments. It appears that the venv dependency is sometimes not available on non-beam Python container images. Users who supply custom containers may run into errors when python3-venv is not installed, and need to install it separately, which is inconvenient. Creating a venv is not strictly required on some runners, therefore #26753 changed the behavior to use global environment where venv was not available. There is a concern that falling back to global environment may have adverse effects on the group of users which benefitted from the separate venv, see: https://github.com/apache/beam/pull/26778#issuecomment-1554897466 . Possible failure modes: - there is a hypothetical one-time flake in creating a venv, and global environment becomes polluted with pipeline dependencies, potentially having side-effects on pipelines running on the same cluster, and using packages from the global environment. - less hypothetical scenario: Flink users decide to use a custom container, that doesn't include venv, and now they silently start running into issues like #21123. Possible avenues to address : - make it explicit whether venv should be used (either opt-in or opt-out). - make venv a requirement for Beam containers. - Determine at runtime whether venv is definitely necessary, and pipeline should fail (some Flink usecses), or definitely unncecessary (Dataflow usecases) and execution can continue in default environment. - Give a clear error when venv not available: "venv is required but not installed in this execution environment. If you wish to disable venv, set RUN_PYTHON_SDK_IN_DEFAULT_ENVIRONMENT=1 environment variable. If you use a custom container, set `ENV RUN_ ...=1` cc: @phoerious (who was working on #21123). ### Issue Priority Priority: 2 (default / most feature requests should be filed as P2) ### Issue Components - [X] Component: Python SDK - [ ] Component: Java SDK - [ ] Component: Go SDK - [ ] Component: Typescript SDK - [ ] Component: IO connector - [ ] Component: Beam examples - [ ] Component: Beam playground - [ ] Component: Beam katas - [ ] Component: Website - [ ] Component: Spark Runner - [ ] Component: Flink Runner - [ ] Component: Samza Runner - [ ] Component: Twister2 Runner - [ ] Component: Hazelcast Jet Runner - [ ] Component: Google Cloud Dataflow Runner -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
