alxmrs opened a new issue, #22349: URL: https://github.com/apache/beam/issues/22349
### What would you like to happen? Acquiring scientific dependencies in the Python ecosystem is challenging. `pip` and `apt-get` alone are not sufficient, for various reasons, the most significant of which is due to community. The scientific python community has standardized to one package manager: Anaconda. Within that package manager, most scientific software is built and distributed via conda-forge. Given this, I propose the following: The Apache Beam Project should builds a new set of Docker images that include a `conda` manage python environment. The Dockerfile for the containers could look like [this](https://github.com/google/weather-tools/blob/main/weather_mv/Dockerfile): ``` ARG py_version=3.8 FROM apache/beam_python${py_version}_sdk:2.40.0 as beam_sdk FROM continuumio/miniconda3:4.12.0 ARG py_version # Update miniconda RUN conda update conda -y # Install desired python version RUN conda install python=${py_version} -y # Install SDK. RUN pip install --no-cache-dir apache-beam[gcp]==2.40.0 # Verify that the image does not have conflicting dependencies. RUN pip check # Copy files from official SDK image, including script/dependencies. COPY --from=beam_sdk /opt/apache/beam /opt/apache/beam # Set the entrypoint to Apache Beam SDK launcher. ENTRYPOINT ["/opt/apache/beam/boot"] ``` From such an image, Python SDK users will gain immense flexibility in adding dependencies to their Beam runtime environment (especially Dataflow, and likely including all remote Beam runners). For example, adding [a genuinely difficult-to-install dependency](https://github.com/ecmwf/metview-python/issues/19#issuecomment-874990120) would be as easy as adding ``` conda install <package-name> -c conda-forge -y ``` to a `setup.py` file (following the [CUSTOM_COMMANDS pattern](https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/#nonpython)). ## Why should Apache Beam do this and not another third party? This is a unique application of custom containers in Beam. Instead of an image with specific dependencies for an application, this package manager can obtain nearly all dependencies in the Python ecosystem. I argue that it makes sense for a member of the Apache project, or similarly open and community-federated project, manage and host this image in order to guard against potential supply chain attacks. Further, including `conda` as a Python SDK runtime environment would accelerate dependency management, especially of the PyData stack, on Apache Beam: It would help avoid the creation of lots of similar Docker images (to host each specific dependency, or else, to duplicate hosting `conda`). ### Issue Priority Priority: 3 ### Issue Component Component: sdk-py-core -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
