Hi, am I correct in assuming that the transmitted envelopes would mostly contain coder-serialized values? If so, wouldn't the header of an envelope just be the number of contained bytes and number of values? I'm probably missing something but with these assumptions I don't see the benefit of using something like Avro/Thrift/Protobuf for serializing the main-input value envelopes. We would just need a system that can send byte data really fast between languages/VMs.
By the way, another interesting question (at least for me) is how other data, such as side-inputs, is going to arrive at the DoFn if we want to support a general interface for different languages. Cheers, Aljoscha On Thu, 16 Jun 2016 at 22:33 Kenneth Knowles <[email protected]> wrote: > (Apologies for the formatting) > > On Thu, Jun 16, 2016 at 12:12 PM, Kenneth Knowles <[email protected]> wrote: > > > Hello everyone! > > > > We are busily working on a Runner API (for building and transmitting > > pipelines) > > and a Fn API (for invoking user-defined functions found within pipelines) > > as > > outlined in the Beam technical vision [1]. Both of these require a > > language-independent serialization technology for interoperability > between > > SDKs > > and runners. > > > > The Fn API includes a high-bandwidth data plane where bundles are > > transmitted > > via some serialization/RPC envelope (inside the envelope, the stream of > > elements is encoded with a coder) to transfer bundles between the runner > > and > > the SDK, so performance is extremely important. There are many choices > for > > high > > performance serialization, and we would like to start the conversation > > about > > what serialization technology is best for Beam. > > > > The goal of this discussion is to arrive at consensus on the question: > > What > > serialization technology should we use for the data plane envelope of the > > Fn > > API? > > > > To facilitate community discussion, we looked at the available > > technologies and > > tried to narrow the choices based on three criteria: > > > > - Performance: What is the size of serialized data? How do we expect the > > technology to affect pipeline speed and cost? etc > > > > - Language support: Does the technology support the most widespread > > language > > for data processing? Does it have a vibrant ecosystem of contributed > > language bindings? etc > > > > - Community: What is the adoption of the technology? How mature is it? > > How > > active is development? How is the documentation? etc > > > > Given these criteria, we came up with four technologies that are good > > contenders. All have similar & adequate schema capabilities. > > > > - Apache Avro: Does not require code gen, but embedding the schema in > the > > data > > could be an issue. Very popular. > > > > - Apache Thrift: Probably a bit faster and compact than Avro. A huge > > number of > > language supported. > > > > - Protocol Buffers 3: Incorporates the lessons that Google has learned > > through > > long-term use of Protocol Buffers. > > > > - FlatBuffers: Some benchmarks imply great performance from the > zero-copy > > mmap > > idea. We would need to run representative experiments. > > > > I want to emphasize that this is a community decision, and this thread is > > just > > the conversation starter for us all to weigh in. We just wanted to do > some > > legwork to focus the discussion if we could. > > > > And there's a minor follow-up question: Once we settle here, is that > > technology > > also suitable for the low-bandwidth Runner API for defining pipelines, or > > does > > anyone think we need to consider a second technology (like JSON) for > > usability > > reasons? > > > > [1] > > > https://docs.google.com/presentation/d/1E9seGPB_VXtY_KZP4HngDPTbsu5RVZFFaTlwEYa88Zw/present?slide=id.g108d3a202f_0_38 > > > > >
