Re: [DISCUSS] Beam data plane serialization tech

Amit Sela Fri, 17 Jun 2016 03:21:07 -0700

+1 on Aljoscha comment, not sure where's the benefit in having a
"schematic" serialization.


I know that Spark and I think Flink as well, use Kryo
<https://github.com/EsotericSoftware/kryo> for serialization (to be
accurate it's Chill <https://github.com/twitter/chill> for Spark) and I
found it very impressive even comparing to "manual" serializations,
 i.e., it seems to outperform Spark's "native" Encoders (1.6+) for
primitives..
In addition it clearly supports Java and Scala, and there are 3rd party
libraries for Clojure and Objective-C.

I guess my bottom-line here agrees with Kenneth - performance and
interoperability - but I'm just not sure if schema based serializers are
*always* the fastest.

As for pipeline serialization, since performance is not the main issue, and
I think usability would be very important, I say +1 for JSON.

For anyone who spent sometime on benchmarking serialization libraries, know
is the time to speak up ;)

Thanks,
Amit

On Fri, Jun 17, 2016 at 12:47 PM Aljoscha Krettek <aljos...@apache.org>
wrote:

> Hi,
> am I correct in assuming that the transmitted envelopes would mostly
> contain coder-serialized values? If so, wouldn't the header of an envelope
> just be the number of contained bytes and number of values? I'm probably
> missing something but with these assumptions I don't see the benefit of
> using something like Avro/Thrift/Protobuf for serializing the main-input
> value envelopes. We would just need a system that can send byte data really
> fast between languages/VMs.
>
> By the way, another interesting question (at least for me) is how other
> data, such as side-inputs, is going to arrive at the DoFn if we want to
> support a general interface for different languages.
>
> Cheers,
> Aljoscha
>
> On Thu, 16 Jun 2016 at 22:33 Kenneth Knowles <k...@google.com.invalid>
> wrote:
>
> > (Apologies for the formatting)
> >
> > On Thu, Jun 16, 2016 at 12:12 PM, Kenneth Knowles <k...@google.com>
> wrote:
> >
> > > Hello everyone!
> > >
> > > We are busily working on a Runner API (for building and transmitting
> > > pipelines)
> > > and a Fn API (for invoking user-defined functions found within
> pipelines)
> > > as
> > > outlined in the Beam technical vision [1]. Both of these require a
> > > language-independent serialization technology for interoperability
> > between
> > > SDKs
> > > and runners.
> > >
> > > The Fn API includes a high-bandwidth data plane where bundles are
> > > transmitted
> > > via some serialization/RPC envelope (inside the envelope, the stream of
> > > elements is encoded with a coder) to transfer bundles between the
> runner
> > > and
> > > the SDK, so performance is extremely important. There are many choices
> > for
> > > high
> > > performance serialization, and we would like to start the conversation
> > > about
> > > what serialization technology is best for Beam.
> > >
> > > The goal of this discussion is to arrive at consensus on the question:
> > > What
> > > serialization technology should we use for the data plane envelope of
> the
> > > Fn
> > > API?
> > >
> > > To facilitate community discussion, we looked at the available
> > > technologies and
> > > tried to narrow the choices based on three criteria:
> > >
> > >  - Performance: What is the size of serialized data? How do we expect
> the
> > >    technology to affect pipeline speed and cost? etc
> > >
> > >  - Language support: Does the technology support the most widespread
> > > language
> > >    for data processing? Does it have a vibrant ecosystem of contributed
> > >    language bindings? etc
> > >
> > >  - Community: What is the adoption of the technology? How mature is it?
> > > How
> > >    active is development? How is the documentation? etc
> > >
> > > Given these criteria, we came up with four technologies that are good
> > > contenders. All have similar & adequate schema capabilities.
> > >
> > >  - Apache Avro: Does not require code gen, but embedding the schema in
> > the
> > > data
> > >    could be an issue. Very popular.
> > >
> > >  - Apache Thrift: Probably a bit faster and compact than Avro. A huge
> > > number of
> > >    language supported.
> > >
> > >  - Protocol Buffers 3: Incorporates the lessons that Google has learned
> > > through
> > >    long-term use of Protocol Buffers.
> > >
> > >  - FlatBuffers: Some benchmarks imply great performance from the
> > zero-copy
> > > mmap
> > >    idea. We would need to run representative experiments.
> > >
> > > I want to emphasize that this is a community decision, and this thread
> is
> > > just
> > > the conversation starter for us all to weigh in. We just wanted to do
> > some
> > > legwork to focus the discussion if we could.
> > >
> > > And there's a minor follow-up question: Once we settle here, is that
> > > technology
> > > also suitable for the low-bandwidth Runner API for defining pipelines,
> or
> > > does
> > > anyone think we need to consider a second technology (like JSON) for
> > > usability
> > > reasons?
> > >
> > > [1]
> > >
> >
> https://docs.google.com/presentation/d/1E9seGPB_VXtY_KZP4HngDPTbsu5RVZFFaTlwEYa88Zw/present?slide=id.g108d3a202f_0_38
> > >
> > >
> >
>

Re: [DISCUSS] Beam data plane serialization tech

Reply via email to