alxmrs opened a new issue, #22675: URL: https://github.com/apache/beam/issues/22675
### What would you like to happen? I'd like to offer an extension to my feature request in #22349 (for `conda` docker support in the Python Beam SDK Docker image). Given that scientific software is most often distributed via the Anaconda package manager (typically, from `conda-forge`), I propose that users of the Python Beam SDK be allowed to add python dependencies via a `environment.yml` file. I'm imagining something like: ``` python era5_climatology.py --conda_environment environment.yml --runner DataflowRunner # etc... ``` After the user specifies the environment, the remote Beam runner should set up a Docker image with the Anaconda package manager, and install all of the dependencies expressed in the `environment.yml` file in the global runtime environment. These packages should be useable from each step in the pipeline. I anticipate that such a feature would be really valuable to members of the scientific python community, who are more versed with Anaconda environments over Docker. Indeed, this could drastically simplify setting up dependencies for Python users, saving them from compiling scientific packages in Docker or debugging `pip` and `setuptools`. There is a tradeoff, however: Docker will offer faster start times than installing dependencies as runtime. I see this feature fitting along side the `pip`, `tar` and `setup.py` approach that already exists for [managing python dependencies](https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/). This feature would be dependent on the existence of standard a standard Anaconda SDK image (see connected issue above). ## Implementation Notes The entrypoint for this feature would start with an additional argument to the `SetupOptions`: https://github.com/apache/beam/blob/c9c57a765dbaae7960deef80c0471766b26636d6/sdks/python/apache_beam/options/pipeline_options.py#L1105 For the `DataflowRunner`, specifically, we could check if an `environment.yml` file was passed in, and choose the `conda` enabled python container following a pattern similar to the existing logic: https://github.com/apache/beam/blob/c9c57a765dbaae7960deef80c0471766b26636d6/sdks/python/apache_beam/runners/dataflow/dataflow_runner.py#L466 To me, it's somewhat of an open question for how we can add Anaconda support to other types of remote runners. I received the tip from @yuvipanda, in a one-off discussion, that Mamba (https://mamba.readthedocs.io/en/latest/) might be a useful tool to integrate with over `miniconda3` specifically. Mamba honors the same interfaces from `conda` files, except it includes a faster implementation and dependency resolver. ### Issue Priority Priority: 3 ### Issue Component Component: sdk-py-core -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
