Responded inline. On Sat, Jan 21, 2017 at 8:20 AM, Amit Sela <[email protected]> wrote:
> This is truly amazing Luke! > > If I understand this right, the runner executing the DoFn will delegate the > function code and input data (and state, coders, etc.) to the container > where it will execute with the user's SDK of choice, right ? Yes, that is correct. > I wonder how the containers relate to the underlying engine's worker > processes ? is it a 1-1, container per worker ? if there's less "work" for > the worker's Java process (for example) now and it becomes a sort of > "dispatcher", would that change the resource allocation commonly used for > the same Pipeline so that the worker processes would require less > resources, while giving those to the container ? > I think with the four services (control, data, state, logging) you can go with a 1-1 relationship or break it up more finely grained and dedicate some machines to have specific tasks. Like you could have a few machines dedicated to log aggregation which all the workers push their logs to. Similarly, you could have some machines that have a lot of memory which would be better to be able to do shuffles in memory and then this cluster of high memory machines could front the data service. I believe there is a lot of flexibility based upon what a runner can do and what it specializes in and believe that with more effort comes more possibilities albeit with increased internal complexity. The layout of resources depends on whether the services and SDK containers are co-hosted on the same machine or whether there is a different architecture in play. In a co-hosted configuration, it seems likely that the SDK container will get more resources but is dependent on the runner and pipeline shape (shuffle heavy dominated pipelines will look different then ParDo dominated pipelines). > About executing sub-graphs, would it be true to say that as long as there's > no shuffle, you could keep executing in the same container ? meaning that > the graph is broken into sub-graphs by shuffles ? > The only thing that is required is that the Apache Beam model is preserved so typical break points will be at shuffles and language crossing points (e.g. Python ParDo -> Java ParDo). A runner is free to break up the graph even more for other reasons. > I have to dig-in deeper, so I could have more questions ;-) thanks Luke! > > On Sat, Jan 21, 2017 at 1:52 AM Lukasz Cwik <[email protected]> > wrote: > > > I updated the PR description to contain the same. > > > > I would start by looking at the API/object model definitions found in > > beam_fn_api.proto > > < > > https://github.com/lukecwik/incubator-beam/blob/fn_api/ > sdks/common/fn-api/src/main/proto/beam_fn_api.proto > > > > > > > Then depending on your interest, look at the following: > > * FnHarness.java > > < > > https://github.com/lukecwik/incubator-beam/blob/fn_api/ > sdks/java/harness/src/main/java/org/apache/beam/fn/harness/FnHarness.java > > > > > is the main entry point. > > * org.apache.beam.fn.harness.data > > < > > https://github.com/lukecwik/incubator-beam/tree/fn_api/ > sdks/java/harness/src/main/java/org/apache/beam/fn/harness/data > > > > > contains the most interesting bits of code since it is able to multiplex > a > > gRPC stream into multiple logical streams of elements bound for multiple > > concurrent process bundle requests. It also contains the code to take > > multiple logical outbound streams and multiplex them back onto a gRPC > > stream. > > * org.apache.beam.runners.core > > < > > https://github.com/lukecwik/incubator-beam/tree/fn_api/ > sdks/java/harness/src/main/java/org/apache/beam/runners/core > > > > > contains additional runners akin to the DoFnRunner found in runners-core > to > > support sources and gRPC endpoints. > > > > Unless your really interested in how domain sockets, epoll, nio channel > > factories or how stream readiness callbacks work in gRPC, I would avoid > the > > packages org.apache.beam.fn.harness.channel and > > org.apache.beam.fn.harness.stream. Similarly I would avoid > > org.apache.beam.fn.harness.fn and org.apache.beam.fn.harness.fake as > they > > don't add anything meaningful to the api. > > > > Code package descriptions: > > > > org.apache.beam.fn.harness.FnHarness: main entry point > > org.apache.beam.fn.harness.control: Control service client and > individual > > request handlers > > org.apache.beam.fn.harness.data: Data service client and logical stream > > multiplexing > > org.apache.beam.runners.core: Additional runners akin to the DoFnRunner > > found in runners-core to support sources and gRPC endpoints > > org.apache.beam.fn.harness.logging: Logging client implementation and > JUL > > logging handler adapter > > org.apache.beam.fn.harness.channel: gRPC channel management > > org.apache.beam.fn.harness.stream: gRPC stream management > > org.apache.beam.fn.harness.fn: Java 8 functional interface extensions > > > > > > On Fri, Jan 20, 2017 at 1:26 PM, Kenneth Knowles <[email protected] > > > > wrote: > > > > > This is awesome! Any chance you could roadmap the PR for us with some > > links > > > into the most interesting bits? > > > > > > On Fri, Jan 20, 2017 at 12:19 PM, Robert Bradshaw < > > > [email protected]> wrote: > > > > > > > Also, note that we can still support the "simple" case. For example, > > > > if the user supplies us with a jar file (as they do now) a runner > > > > could launch it as a subprocesses and communicate with it via this > > > > same Fn API or install it in a fixed container itself--the user > > > > doesn't *need* to know about docker or manually manage containers > (and > > > > indeed the Fn API could be used in-process, cross-process, > > > > cross-container, and even cross-machine). > > > > > > > > However docker provides a nice cross-language way of specifying the > > > > environment including all dependencies (especially for languages like > > > > Python or C where the equivalent of a cross-platform, self-contained > > > > jar isn't as easy to produce) and is strictly more powerful and > > > > flexible (specifically it isolates the runtime environment and one > can > > > > even use it for local testing). > > > > > > > > Slicing a worker up like this without sacrificing performance is an > > > > ambitious goal, but essential to the story of being able to mix and > > > > match runners and SDKs arbitrarily, and I think this is a great > start. > > > > > > > > > > > > On Fri, Jan 20, 2017 at 9:39 AM, Lukasz Cwik > <[email protected] > > > > > > > wrote: > > > > > Your correct, a docker container is created that contains the > > execution > > > > > environment the user wants or the user re-uses an existing one > > > (allowing > > > > > for a user to embed all their code/dependencies or use a container > > that > > > > can > > > > > deploy code/dependencies on demand). > > > > > A user creates a pipeline saying which docker container they want > to > > > use > > > > > (this starts to allow for multiple container definitions within a > > > single > > > > > pipeline to support multiple languages, versioning, ...). > > > > > A runner would then be responsible for launching one or more of > these > > > > > containers in a cluster manager of their choice (scaling up or down > > the > > > > > number of instances depending on demand/load/...). > > > > > A runner then interacts with the docker containers over the gRPC > > > service > > > > > definitions to delegate processing to. > > > > > > > > > > > > > > > On Fri, Jan 20, 2017 at 4:56 AM, Jean-Baptiste Onofré < > > [email protected] > > > > > > > > > wrote: > > > > > > > > > >> Hi Luke, > > > > >> > > > > >> that's really great and very promising ! > > > > >> > > > > >> It's really ambitious but I like the idea. Just to clarify: the > > > purpose > > > > of > > > > >> using gRPC is once the docker container is running, then we can > > > > "interact" > > > > >> with the container to spread and delegate processing to the docker > > > > >> container, correct ? > > > > >> The users/devops have to setup the docker containers as > > prerequisite. > > > > >> Then, the "location" of the containers (kind of container > registry) > > is > > > > set > > > > >> via the pipeline options and used by gRPC ? > > > > >> > > > > >> Thanks Luke ! > > > > >> > > > > >> Regards > > > > >> JB > > > > >> > > > > >> > > > > >> On 01/19/2017 03:56 PM, Lukasz Cwik wrote: > > > > >> > > > > >>> I have been prototyping several components towards the Beam > > technical > > > > >>> vision of being able to execute an arbitrary language using an > > > > arbitrary > > > > >>> runner. > > > > >>> > > > > >>> I would like to share this overview [1] of what I have been > working > > > > >>> towards. I also share this PR [2] with a proposed API, service > > > > definitions > > > > >>> and partial implementation. > > > > >>> > > > > >>> 1: https://s.apache.org/beam-fn-api > > > > >>> 2: https://github.com/apache/beam/pull/1801 > > > > >>> > > > > >>> Please comment on the overview within this thread, and any > specific > > > > code > > > > >>> comments on the PR directly. > > > > >>> > > > > >>> Luke > > > > >>> > > > > >>> > > > > >> -- > > > > >> Jean-Baptiste Onofré > > > > >> [email protected] > > > > >> http://blog.nanthrax.net > > > > >> Talend - http://www.talend.com > > > > >> > > > > > > > > > >
