tvalentyn opened a new issue, #26792:
URL: https://github.com/apache/beam/issues/26792

   ### What would you like to happen?
   
   https://github.com/apache/beam/pull/16658 made a change to Python SDK 
harness container boot sequence to launch SDK processes in separately created 
virtual environments.
   
   It appears that the venv dependency is sometimes not available on non-beam 
Python container images. Users who supply custom containers may run into errors 
when python3-venv is not installed, and need to install it separately, which is 
inconvenient.
   
   Creating a venv is not strictly required on some runners, therefore #26753 
changed the behavior to use global environment where venv was not available.
   
   There is a concern that falling back to global environment may have adverse 
effects on the group of users which benefitted from the separate venv, see: 
https://github.com/apache/beam/pull/26778#issuecomment-1554897466 .
   
   Possible failure modes:
   - there is a hypothetical one-time flake in creating a venv, and global 
environment becomes polluted with pipeline dependencies, potentially having 
side-effects on pipelines running on the same cluster, and using packages from 
the global environment.
   - less hypothetical scenario: Flink users decide to use a custom container, 
that doesn't include venv, and now they silently start running into issues like 
#21123.
   
   Possible avenues to address :
   - make it explicit whether venv should be used (either opt-in or opt-out).
   - make venv a requirement for Beam containers. 
   - Determine at runtime whether venv is definitely necessary, and pipeline 
should fail (some Flink usecses), or definitely unncecessary (Dataflow 
usecases) and execution can continue in default environment.
   - Give a clear error when venv not available: "venv is required but not 
installed in this execution environment. If you wish to disable venv, set  
RUN_PYTHON_SDK_IN_DEFAULT_ENVIRONMENT=1 environment variable. If you use a 
custom container, set `ENV RUN_ ...=1`
   
   cc: @phoerious (who was working on #21123). 
   
   ### Issue Priority
   
   Priority: 2 (default / most feature requests should be filed as P2)
   
   ### Issue Components
   
   - [X] Component: Python SDK
   - [ ] Component: Java SDK
   - [ ] Component: Go SDK
   - [ ] Component: Typescript SDK
   - [ ] Component: IO connector
   - [ ] Component: Beam examples
   - [ ] Component: Beam playground
   - [ ] Component: Beam katas
   - [ ] Component: Website
   - [ ] Component: Spark Runner
   - [ ] Component: Flink Runner
   - [ ] Component: Samza Runner
   - [ ] Component: Twister2 Runner
   - [ ] Component: Hazelcast Jet Runner
   - [ ] Component: Google Cloud Dataflow Runner


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to