alxmrs opened a new issue, #22675:
URL: https://github.com/apache/beam/issues/22675

   ### What would you like to happen?
   
   I'd like to offer an extension to my feature request in #22349 (for `conda` 
docker support in the Python Beam SDK Docker image). Given that scientific 
software is most often distributed via the Anaconda package manager (typically, 
from `conda-forge`), I propose that users of the Python Beam SDK be allowed to 
add python dependencies via a `environment.yml` file. I'm imagining something 
like: 
   ```
   python era5_climatology.py --conda_environment environment.yml --runner 
DataflowRunner  # etc...
   ```
   
   After the user specifies the environment, the remote Beam runner should set 
up a Docker image with the Anaconda package manager, and install all of the 
dependencies expressed in the `environment.yml` file in the global runtime 
environment. These packages should be useable from each step in the pipeline.
   
   I anticipate that such a feature would be really valuable to members of the 
scientific python community, who are more versed with Anaconda environments 
over Docker. Indeed, this could drastically simplify setting up dependencies 
for Python users, saving them from compiling scientific packages in Docker or 
debugging `pip` and `setuptools`. 
   
   There is a tradeoff, however: Docker will offer faster start times than 
installing dependencies as runtime. I see this feature fitting along side the 
`pip`, `tar` and `setup.py` approach that already exists for [managing python 
dependencies](https://beam.apache.org/documentation/sdks/python-pipeline-dependencies/).
   
   This feature would be dependent on the existence of standard a standard 
Anaconda SDK image (see connected issue above).
   
   ## Implementation Notes
   
   The entrypoint for this feature would start with an additional argument to 
the `SetupOptions`:
   
   
https://github.com/apache/beam/blob/c9c57a765dbaae7960deef80c0471766b26636d6/sdks/python/apache_beam/options/pipeline_options.py#L1105
   
   For the `DataflowRunner`, specifically, we could check if an 
`environment.yml` file was passed in, and choose the `conda` enabled python 
container following a pattern similar to the existing logic: 
   
   
https://github.com/apache/beam/blob/c9c57a765dbaae7960deef80c0471766b26636d6/sdks/python/apache_beam/runners/dataflow/dataflow_runner.py#L466
   
   To me, it's somewhat of an open question for how we can add Anaconda support 
to other types of remote runners. 
   
   I received the tip from @yuvipanda, in a one-off discussion, that Mamba 
(https://mamba.readthedocs.io/en/latest/) might be a useful tool to integrate 
with over `miniconda3` specifically. Mamba honors the same interfaces from 
`conda` files, except it includes a faster implementation and dependency 
resolver. 
   
   ### Issue Priority
   
   Priority: 3
   
   ### Issue Component
   
   Component: sdk-py-core


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to