Hi,
am I correct in assuming that the transmitted envelopes would mostly
contain coder-serialized values? If so, wouldn't the header of an envelope
just be the number of contained bytes and number of values? I'm probably
missing something but with these assumptions I don't see the benefit of
using something like Avro/Thrift/Protobuf for serializing the main-input
value envelopes. We would just need a system that can send byte data really
fast between languages/VMs.

By the way, another interesting question (at least for me) is how other
data, such as side-inputs, is going to arrive at the DoFn if we want to
support a general interface for different languages.

Cheers,
Aljoscha

On Thu, 16 Jun 2016 at 22:33 Kenneth Knowles <[email protected]> wrote:

> (Apologies for the formatting)
>
> On Thu, Jun 16, 2016 at 12:12 PM, Kenneth Knowles <[email protected]> wrote:
>
> > Hello everyone!
> >
> > We are busily working on a Runner API (for building and transmitting
> > pipelines)
> > and a Fn API (for invoking user-defined functions found within pipelines)
> > as
> > outlined in the Beam technical vision [1]. Both of these require a
> > language-independent serialization technology for interoperability
> between
> > SDKs
> > and runners.
> >
> > The Fn API includes a high-bandwidth data plane where bundles are
> > transmitted
> > via some serialization/RPC envelope (inside the envelope, the stream of
> > elements is encoded with a coder) to transfer bundles between the runner
> > and
> > the SDK, so performance is extremely important. There are many choices
> for
> > high
> > performance serialization, and we would like to start the conversation
> > about
> > what serialization technology is best for Beam.
> >
> > The goal of this discussion is to arrive at consensus on the question:
> > What
> > serialization technology should we use for the data plane envelope of the
> > Fn
> > API?
> >
> > To facilitate community discussion, we looked at the available
> > technologies and
> > tried to narrow the choices based on three criteria:
> >
> >  - Performance: What is the size of serialized data? How do we expect the
> >    technology to affect pipeline speed and cost? etc
> >
> >  - Language support: Does the technology support the most widespread
> > language
> >    for data processing? Does it have a vibrant ecosystem of contributed
> >    language bindings? etc
> >
> >  - Community: What is the adoption of the technology? How mature is it?
> > How
> >    active is development? How is the documentation? etc
> >
> > Given these criteria, we came up with four technologies that are good
> > contenders. All have similar & adequate schema capabilities.
> >
> >  - Apache Avro: Does not require code gen, but embedding the schema in
> the
> > data
> >    could be an issue. Very popular.
> >
> >  - Apache Thrift: Probably a bit faster and compact than Avro. A huge
> > number of
> >    language supported.
> >
> >  - Protocol Buffers 3: Incorporates the lessons that Google has learned
> > through
> >    long-term use of Protocol Buffers.
> >
> >  - FlatBuffers: Some benchmarks imply great performance from the
> zero-copy
> > mmap
> >    idea. We would need to run representative experiments.
> >
> > I want to emphasize that this is a community decision, and this thread is
> > just
> > the conversation starter for us all to weigh in. We just wanted to do
> some
> > legwork to focus the discussion if we could.
> >
> > And there's a minor follow-up question: Once we settle here, is that
> > technology
> > also suitable for the low-bandwidth Runner API for defining pipelines, or
> > does
> > anyone think we need to consider a second technology (like JSON) for
> > usability
> > reasons?
> >
> > [1]
> >
> https://docs.google.com/presentation/d/1E9seGPB_VXtY_KZP4HngDPTbsu5RVZFFaTlwEYa88Zw/present?slide=id.g108d3a202f_0_38
> >
> >
>

Reply via email to