Hello everyone!

We are busily working on a Runner API (for building and transmitting
pipelines)
and a Fn API (for invoking user-defined functions found within pipelines) as
outlined in the Beam technical vision [1]. Both of these require a
language-independent serialization technology for interoperability between
SDKs
and runners.

The Fn API includes a high-bandwidth data plane where bundles are
transmitted
via some serialization/RPC envelope (inside the envelope, the stream of
elements is encoded with a coder) to transfer bundles between the runner and
the SDK, so performance is extremely important. There are many choices for
high
performance serialization, and we would like to start the conversation about
what serialization technology is best for Beam.

The goal of this discussion is to arrive at consensus on the question: What
serialization technology should we use for the data plane envelope of the Fn
API?

To facilitate community discussion, we looked at the available technologies
and
tried to narrow the choices based on three criteria:

 - Performance: What is the size of serialized data? How do we expect the
   technology to affect pipeline speed and cost? etc

 - Language support: Does the technology support the most widespread
language
   for data processing? Does it have a vibrant ecosystem of contributed
   language bindings? etc

 - Community: What is the adoption of the technology? How mature is it? How
   active is development? How is the documentation? etc

Given these criteria, we came up with four technologies that are good
contenders. All have similar & adequate schema capabilities.

 - Apache Avro: Does not require code gen, but embedding the schema in the
data
   could be an issue. Very popular.

 - Apache Thrift: Probably a bit faster and compact than Avro. A huge
number of
   language supported.

 - Protocol Buffers 3: Incorporates the lessons that Google has learned
through
   long-term use of Protocol Buffers.

 - FlatBuffers: Some benchmarks imply great performance from the zero-copy
mmap
   idea. We would need to run representative experiments.

I want to emphasize that this is a community decision, and this thread is
just
the conversation starter for us all to weigh in. We just wanted to do some
legwork to focus the discussion if we could.

And there's a minor follow-up question: Once we settle here, is that
technology
also suitable for the low-bandwidth Runner API for defining pipelines, or
does
anyone think we need to consider a second technology (like JSON) for
usability
reasons?

[1]
https://docs.google.com/presentation/d/1E9seGPB_VXtY_KZP4HngDPTbsu5RVZFFaTlwEYa88Zw/present?slide=id.g108d3a202f_0_38

Reply via email to