Hello everyone! We are busily working on a Runner API (for building and transmitting pipelines) and a Fn API (for invoking user-defined functions found within pipelines) as outlined in the Beam technical vision [1]. Both of these require a language-independent serialization technology for interoperability between SDKs and runners.
The Fn API includes a high-bandwidth data plane where bundles are transmitted via some serialization/RPC envelope (inside the envelope, the stream of elements is encoded with a coder) to transfer bundles between the runner and the SDK, so performance is extremely important. There are many choices for high performance serialization, and we would like to start the conversation about what serialization technology is best for Beam. The goal of this discussion is to arrive at consensus on the question: What serialization technology should we use for the data plane envelope of the Fn API? To facilitate community discussion, we looked at the available technologies and tried to narrow the choices based on three criteria: - Performance: What is the size of serialized data? How do we expect the technology to affect pipeline speed and cost? etc - Language support: Does the technology support the most widespread language for data processing? Does it have a vibrant ecosystem of contributed language bindings? etc - Community: What is the adoption of the technology? How mature is it? How active is development? How is the documentation? etc Given these criteria, we came up with four technologies that are good contenders. All have similar & adequate schema capabilities. - Apache Avro: Does not require code gen, but embedding the schema in the data could be an issue. Very popular. - Apache Thrift: Probably a bit faster and compact than Avro. A huge number of language supported. - Protocol Buffers 3: Incorporates the lessons that Google has learned through long-term use of Protocol Buffers. - FlatBuffers: Some benchmarks imply great performance from the zero-copy mmap idea. We would need to run representative experiments. I want to emphasize that this is a community decision, and this thread is just the conversation starter for us all to weigh in. We just wanted to do some legwork to focus the discussion if we could. And there's a minor follow-up question: Once we settle here, is that technology also suitable for the low-bandwidth Runner API for defining pipelines, or does anyone think we need to consider a second technology (like JSON) for usability reasons? [1] https://docs.google.com/presentation/d/1E9seGPB_VXtY_KZP4HngDPTbsu5RVZFFaTlwEYa88Zw/present?slide=id.g108d3a202f_0_38
