Cool, thanks! It seems like some good follow ups might exist to simplify things for Python users so they don’t have to roll their own docker files (like allow them provide a requirements.txt which is used in the dockerfile) :)
I’m really excited about the direction with the containerized runners :) On Sat, Nov 18, 2017 at 6:12 PM Henning Rohde <[email protected]> wrote: > A benefit of using docker containers is that (nearly) arbitrary native > dependencies can be installed in the container image itself by either the > user or SDK. For example, the (minimal, in progress) Python container > Dockerfile is here: > > > > https://github.com/apache/beam/blob/1039f5b9682fa6aa5fba256110c63caf4d0da41f/sdks/python/container/Dockerfile > > Any user could simply augment it with "pip install" commands, say, or use > something else entirely (although the corresponding boot program may also > need to change in that case). The Python SDK itself might also include > options/scripts/etc to make common customizations easier to use to avoid > installing them at runtime. Multiple Dockerfiles can also co-exist. For > actually passing the container image to the runner it's a choice make by > each SDK, which is why it's not discussed much in the portability context. > But a uniform flag along the lines of --sdk_harness_container_image to > include the image into the pipeline proto would seem desirable. That said, > I don't think how all these capabilities would best be exposed to users has > been much explored yet in any SDK. > > Finally, there has been several thoughts on cross-language pipelines and I > think it's a very exciting aspect of the portability framework. A doc is > here: > > https://s.apache.org/beam-mixed-language-pipelines. > > It is also linked from design section in the portability page. > > Thanks, > Henning > > > On Sat, Nov 18, 2017 at 6:33 AM, Holden Karau <[email protected]> > wrote: > > > So I was looking through https://beam.apache.org/contribute/portability/ > > which lead me to BEAM-2900, and then to > > https://docs.google.com/document/d/1n6s3BOxOPct3uF4UgbbI9O9rpdiKW > > FH9R6mtVmR7xp0/edit# > > . > > > > I was wondering if there is any considerations being given to native > > dependencies that user code may have (especially things like Python > > packages which can be super painful to deal with in a Spark cluster > unless > > you use one of the vendor solutions)? > > > > Also, and this may be a terrible idea, but has there been thought given > to > > the idea of a cross-language pipelines (I see these in Spark occasionally > > but with the DL stuff happening I suspect we might see users wanting > > cross-language functionality more often)? > > > > I also saw "Proposal: introduce an option to pass SDK harness container > > image in Beam SDKs" & it seems like Robert brought up the benefits of > using > > Docker for Python runners, but I don't see the details on how we would > > expose that to users it in the design docs I've found yet (which could > very > > well be I'm not looking at the right docs). > > > > Cheers, > > > > Holden :) > > > > -- > > Twitter: https://twitter.com/holdenkarau > > > -- Twitter: https://twitter.com/holdenkarau
