+1 on Aljoscha comment, not sure where's the benefit in having a "schematic" serialization.
I know that Spark and I think Flink as well, use Kryo <https://github.com/EsotericSoftware/kryo> for serialization (to be accurate it's Chill <https://github.com/twitter/chill> for Spark) and I found it very impressive even comparing to "manual" serializations, i.e., it seems to outperform Spark's "native" Encoders (1.6+) for primitives.. In addition it clearly supports Java and Scala, and there are 3rd party libraries for Clojure and Objective-C. I guess my bottom-line here agrees with Kenneth - performance and interoperability - but I'm just not sure if schema based serializers are *always* the fastest. As for pipeline serialization, since performance is not the main issue, and I think usability would be very important, I say +1 for JSON. For anyone who spent sometime on benchmarking serialization libraries, know is the time to speak up ;) Thanks, Amit On Fri, Jun 17, 2016 at 12:47 PM Aljoscha Krettek <aljos...@apache.org> wrote: > Hi, > am I correct in assuming that the transmitted envelopes would mostly > contain coder-serialized values? If so, wouldn't the header of an envelope > just be the number of contained bytes and number of values? I'm probably > missing something but with these assumptions I don't see the benefit of > using something like Avro/Thrift/Protobuf for serializing the main-input > value envelopes. We would just need a system that can send byte data really > fast between languages/VMs. > > By the way, another interesting question (at least for me) is how other > data, such as side-inputs, is going to arrive at the DoFn if we want to > support a general interface for different languages. > > Cheers, > Aljoscha > > On Thu, 16 Jun 2016 at 22:33 Kenneth Knowles <k...@google.com.invalid> > wrote: > > > (Apologies for the formatting) > > > > On Thu, Jun 16, 2016 at 12:12 PM, Kenneth Knowles <k...@google.com> > wrote: > > > > > Hello everyone! > > > > > > We are busily working on a Runner API (for building and transmitting > > > pipelines) > > > and a Fn API (for invoking user-defined functions found within > pipelines) > > > as > > > outlined in the Beam technical vision [1]. Both of these require a > > > language-independent serialization technology for interoperability > > between > > > SDKs > > > and runners. > > > > > > The Fn API includes a high-bandwidth data plane where bundles are > > > transmitted > > > via some serialization/RPC envelope (inside the envelope, the stream of > > > elements is encoded with a coder) to transfer bundles between the > runner > > > and > > > the SDK, so performance is extremely important. There are many choices > > for > > > high > > > performance serialization, and we would like to start the conversation > > > about > > > what serialization technology is best for Beam. > > > > > > The goal of this discussion is to arrive at consensus on the question: > > > What > > > serialization technology should we use for the data plane envelope of > the > > > Fn > > > API? > > > > > > To facilitate community discussion, we looked at the available > > > technologies and > > > tried to narrow the choices based on three criteria: > > > > > > - Performance: What is the size of serialized data? How do we expect > the > > > technology to affect pipeline speed and cost? etc > > > > > > - Language support: Does the technology support the most widespread > > > language > > > for data processing? Does it have a vibrant ecosystem of contributed > > > language bindings? etc > > > > > > - Community: What is the adoption of the technology? How mature is it? > > > How > > > active is development? How is the documentation? etc > > > > > > Given these criteria, we came up with four technologies that are good > > > contenders. All have similar & adequate schema capabilities. > > > > > > - Apache Avro: Does not require code gen, but embedding the schema in > > the > > > data > > > could be an issue. Very popular. > > > > > > - Apache Thrift: Probably a bit faster and compact than Avro. A huge > > > number of > > > language supported. > > > > > > - Protocol Buffers 3: Incorporates the lessons that Google has learned > > > through > > > long-term use of Protocol Buffers. > > > > > > - FlatBuffers: Some benchmarks imply great performance from the > > zero-copy > > > mmap > > > idea. We would need to run representative experiments. > > > > > > I want to emphasize that this is a community decision, and this thread > is > > > just > > > the conversation starter for us all to weigh in. We just wanted to do > > some > > > legwork to focus the discussion if we could. > > > > > > And there's a minor follow-up question: Once we settle here, is that > > > technology > > > also suitable for the low-bandwidth Runner API for defining pipelines, > or > > > does > > > anyone think we need to consider a second technology (like JSON) for > > > usability > > > reasons? > > > > > > [1] > > > > > > https://docs.google.com/presentation/d/1E9seGPB_VXtY_KZP4HngDPTbsu5RVZFFaTlwEYa88Zw/present?slide=id.g108d3a202f_0_38 > > > > > > > > >