alxmrs opened a new issue, #22349:
URL: https://github.com/apache/beam/issues/22349

   ### What would you like to happen?
   
   Acquiring scientific dependencies in the Python ecosystem is challenging. 
`pip` and `apt-get` alone are not sufficient, for various reasons, the most 
significant of which is due to community. The scientific python community has 
standardized to one package manager: Anaconda. Within that package manager, 
most scientific software is built and distributed via conda-forge. 
   
   Given this, I propose the following: The Apache Beam Project should builds a 
new set of Docker images that include a `conda` manage python environment. The 
Dockerfile for the containers could look like 
[this](https://github.com/google/weather-tools/blob/main/weather_mv/Dockerfile):
 
   ```
   ARG py_version=3.8
   FROM apache/beam_python${py_version}_sdk:2.40.0 as beam_sdk
   FROM continuumio/miniconda3:4.12.0
   ARG py_version
   
   # Update miniconda
   RUN conda update conda -y
   
   # Install desired python version
   RUN conda install python=${py_version} -y
   
   # Install SDK.
   RUN pip install --no-cache-dir apache-beam[gcp]==2.40.0
   
   # Verify that the image does not have conflicting dependencies.
   RUN pip check
   
   # Copy files from official SDK image, including script/dependencies.
   COPY --from=beam_sdk /opt/apache/beam /opt/apache/beam
   
   # Set the entrypoint to Apache Beam SDK launcher.
   ENTRYPOINT ["/opt/apache/beam/boot"]
   ```
   
   From such an image, Python SDK users will gain immense flexibility in adding 
dependencies to their Beam runtime environment (especially Dataflow, and likely 
including all remote Beam runners). For example, adding [a genuinely 
difficult-to-install 
dependency](https://github.com/ecmwf/metview-python/issues/19#issuecomment-874990120)
 would be as easy as adding 
   
   ```
   conda install <package-name> -c conda-forge -y
   ```
   
   to a `setup.py` file (following the [CUSTOM_COMMANDS 
pattern](https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#nonpython)).
   
   
   ## Why should Apache Beam do this and  not another third party?
   
   This is a unique application of custom containers in Beam. Instead of an 
image with specific dependencies for an application, this package manager can 
obtain nearly all dependencies in the Python ecosystem. I argue that it makes 
sense for a member of the Apache project, or similarly open and 
community-federated project, manage and host this image in order to guard 
against potential supply chain attacks. Further, including `conda` as a Python 
SDK runtime environment would accelerate dependency management, especially of 
the PyData stack, on Apache Beam: It would help avoid the creation of lots of 
similar Docker images (to host each specific dependency, or else, to duplicate 
hosting `conda`).
   
   ### Issue Priority
   
   Priority: 3
   
   ### Issue Component
   
   Component: sdk-py-core


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to