Awesome job Lukasz, Excellent, I have to confess the first time I heard about the Fn API idea I was a bit incredulous, but you are making it real, amazing!
Just one question from your document, you said that 80% of the extra (15%) time goes into encoding and decoding the data for your test case, can you expand in your current ideas to improve this? (I am not sure I completely understand the issue). On Mon, Jan 23, 2017 at 7:10 PM, Lukasz Cwik <[email protected]> wrote: > Responded inline. > > On Sat, Jan 21, 2017 at 8:20 AM, Amit Sela <[email protected]> wrote: > > > This is truly amazing Luke! > > > > If I understand this right, the runner executing the DoFn will delegate > the > > function code and input data (and state, coders, etc.) to the container > > where it will execute with the user's SDK of choice, right ? > > > Yes, that is correct. > > > > I wonder how the containers relate to the underlying engine's worker > > processes ? is it a 1-1, container per worker ? if there's less "work" > for > > the worker's Java process (for example) now and it becomes a sort of > > "dispatcher", would that change the resource allocation commonly used for > > the same Pipeline so that the worker processes would require less > > resources, while giving those to the container ? > > > > I think with the four services (control, data, state, logging) you can go > with a 1-1 relationship or break it up more finely grained and dedicate > some machines to have specific tasks. Like you could have a few machines > dedicated to log aggregation which all the workers push their logs to. > Similarly, you could have some machines that have a lot of memory which > would be better to be able to do shuffles in memory and then this cluster > of high memory machines could front the data service. I believe there is a > lot of flexibility based upon what a runner can do and what it specializes > in and believe that with more effort comes more possibilities albeit with > increased internal complexity. > > The layout of resources depends on whether the services and SDK containers > are co-hosted on the same machine or whether there is a different > architecture in play. In a co-hosted configuration, it seems likely that > the SDK container will get more resources but is dependent on the runner > and pipeline shape (shuffle heavy dominated pipelines will look different > then ParDo dominated pipelines). > > > > About executing sub-graphs, would it be true to say that as long as > there's > > no shuffle, you could keep executing in the same container ? meaning that > > the graph is broken into sub-graphs by shuffles ? > > > > The only thing that is required is that the Apache Beam model is preserved > so typical break points will be at shuffles and language crossing points > (e.g. Python ParDo -> Java ParDo). A runner is free to break up the graph > even more for other reasons. > > > > I have to dig-in deeper, so I could have more questions ;-) thanks Luke! > > > > On Sat, Jan 21, 2017 at 1:52 AM Lukasz Cwik <[email protected]> > > wrote: > > > > > I updated the PR description to contain the same. > > > > > > I would start by looking at the API/object model definitions found in > > > beam_fn_api.proto > > > < > > > https://github.com/lukecwik/incubator-beam/blob/fn_api/ > > sdks/common/fn-api/src/main/proto/beam_fn_api.proto > > > > > > > > > > Then depending on your interest, look at the following: > > > * FnHarness.java > > > < > > > https://github.com/lukecwik/incubator-beam/blob/fn_api/ > > sdks/java/harness/src/main/java/org/apache/beam/fn/ > harness/FnHarness.java > > > > > > > is the main entry point. > > > * org.apache.beam.fn.harness.data > > > < > > > https://github.com/lukecwik/incubator-beam/tree/fn_api/ > > sdks/java/harness/src/main/java/org/apache/beam/fn/harness/data > > > > > > > contains the most interesting bits of code since it is able to > multiplex > > a > > > gRPC stream into multiple logical streams of elements bound for > multiple > > > concurrent process bundle requests. It also contains the code to take > > > multiple logical outbound streams and multiplex them back onto a gRPC > > > stream. > > > * org.apache.beam.runners.core > > > < > > > https://github.com/lukecwik/incubator-beam/tree/fn_api/ > > sdks/java/harness/src/main/java/org/apache/beam/runners/core > > > > > > > contains additional runners akin to the DoFnRunner found in > runners-core > > to > > > support sources and gRPC endpoints. > > > > > > Unless your really interested in how domain sockets, epoll, nio channel > > > factories or how stream readiness callbacks work in gRPC, I would avoid > > the > > > packages org.apache.beam.fn.harness.channel and > > > org.apache.beam.fn.harness.stream. Similarly I would avoid > > > org.apache.beam.fn.harness.fn and org.apache.beam.fn.harness.fake as > > they > > > don't add anything meaningful to the api. > > > > > > Code package descriptions: > > > > > > org.apache.beam.fn.harness.FnHarness: main entry point > > > org.apache.beam.fn.harness.control: Control service client and > > individual > > > request handlers > > > org.apache.beam.fn.harness.data: Data service client and logical > stream > > > multiplexing > > > org.apache.beam.runners.core: Additional runners akin to the DoFnRunner > > > found in runners-core to support sources and gRPC endpoints > > > org.apache.beam.fn.harness.logging: Logging client implementation and > > JUL > > > logging handler adapter > > > org.apache.beam.fn.harness.channel: gRPC channel management > > > org.apache.beam.fn.harness.stream: gRPC stream management > > > org.apache.beam.fn.harness.fn: Java 8 functional interface extensions > > > > > > > > > On Fri, Jan 20, 2017 at 1:26 PM, Kenneth Knowles > <[email protected] > > > > > > wrote: > > > > > > > This is awesome! Any chance you could roadmap the PR for us with some > > > links > > > > into the most interesting bits? > > > > > > > > On Fri, Jan 20, 2017 at 12:19 PM, Robert Bradshaw < > > > > [email protected]> wrote: > > > > > > > > > Also, note that we can still support the "simple" case. For > example, > > > > > if the user supplies us with a jar file (as they do now) a runner > > > > > could launch it as a subprocesses and communicate with it via this > > > > > same Fn API or install it in a fixed container itself--the user > > > > > doesn't *need* to know about docker or manually manage containers > > (and > > > > > indeed the Fn API could be used in-process, cross-process, > > > > > cross-container, and even cross-machine). > > > > > > > > > > However docker provides a nice cross-language way of specifying the > > > > > environment including all dependencies (especially for languages > like > > > > > Python or C where the equivalent of a cross-platform, > self-contained > > > > > jar isn't as easy to produce) and is strictly more powerful and > > > > > flexible (specifically it isolates the runtime environment and one > > can > > > > > even use it for local testing). > > > > > > > > > > Slicing a worker up like this without sacrificing performance is an > > > > > ambitious goal, but essential to the story of being able to mix and > > > > > match runners and SDKs arbitrarily, and I think this is a great > > start. > > > > > > > > > > > > > > > On Fri, Jan 20, 2017 at 9:39 AM, Lukasz Cwik > > <[email protected] > > > > > > > > > wrote: > > > > > > Your correct, a docker container is created that contains the > > > execution > > > > > > environment the user wants or the user re-uses an existing one > > > > (allowing > > > > > > for a user to embed all their code/dependencies or use a > container > > > that > > > > > can > > > > > > deploy code/dependencies on demand). > > > > > > A user creates a pipeline saying which docker container they want > > to > > > > use > > > > > > (this starts to allow for multiple container definitions within a > > > > single > > > > > > pipeline to support multiple languages, versioning, ...). > > > > > > A runner would then be responsible for launching one or more of > > these > > > > > > containers in a cluster manager of their choice (scaling up or > down > > > the > > > > > > number of instances depending on demand/load/...). > > > > > > A runner then interacts with the docker containers over the gRPC > > > > service > > > > > > definitions to delegate processing to. > > > > > > > > > > > > > > > > > > On Fri, Jan 20, 2017 at 4:56 AM, Jean-Baptiste Onofré < > > > [email protected] > > > > > > > > > > > wrote: > > > > > > > > > > > >> Hi Luke, > > > > > >> > > > > > >> that's really great and very promising ! > > > > > >> > > > > > >> It's really ambitious but I like the idea. Just to clarify: the > > > > purpose > > > > > of > > > > > >> using gRPC is once the docker container is running, then we can > > > > > "interact" > > > > > >> with the container to spread and delegate processing to the > docker > > > > > >> container, correct ? > > > > > >> The users/devops have to setup the docker containers as > > > prerequisite. > > > > > >> Then, the "location" of the containers (kind of container > > registry) > > > is > > > > > set > > > > > >> via the pipeline options and used by gRPC ? > > > > > >> > > > > > >> Thanks Luke ! > > > > > >> > > > > > >> Regards > > > > > >> JB > > > > > >> > > > > > >> > > > > > >> On 01/19/2017 03:56 PM, Lukasz Cwik wrote: > > > > > >> > > > > > >>> I have been prototyping several components towards the Beam > > > technical > > > > > >>> vision of being able to execute an arbitrary language using an > > > > > arbitrary > > > > > >>> runner. > > > > > >>> > > > > > >>> I would like to share this overview [1] of what I have been > > working > > > > > >>> towards. I also share this PR [2] with a proposed API, service > > > > > definitions > > > > > >>> and partial implementation. > > > > > >>> > > > > > >>> 1: https://s.apache.org/beam-fn-api > > > > > >>> 2: https://github.com/apache/beam/pull/1801 > > > > > >>> > > > > > >>> Please comment on the overview within this thread, and any > > specific > > > > > code > > > > > >>> comments on the PR directly. > > > > > >>> > > > > > >>> Luke > > > > > >>> > > > > > >>> > > > > > >> -- > > > > > >> Jean-Baptiste Onofré > > > > > >> [email protected] > > > > > >> http://blog.nanthrax.net > > > > > >> Talend - http://www.talend.com > > > > > >> > > > > > > > > > > > > > > >
