[
https://issues.apache.org/jira/browse/FLINK-18202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17233527#comment-17233527
]
Suhan Mao commented on FLINK-18202:
-----------------------------------
I have some experience and already implemented the code of converting between
protobuf and row. Both of below two features are already online and serve
millions of QPS now. We have implemented using proto2 but it is not a difficult
thing to migrate to proto3.
# protobuf -> row。According to benchao's benchmark result, java native API is
over 4 times faster than dynamic message. It is also the case we met in our
real environment, so we use janino codegen way to convert pb to flink row
according to the pb Descriptor and required schema. We have already finished
the implementation and it is very close to the native API performance.We
support all kinds of complex types in pb including nested message or map.
# row -> protobuf. In our test, the java native API to serialize is still not
fast enough. Because in the whole process, we need to access all columns 3
times. (1) getter setter method of builder, (2) build() to copy the fields (3)
toBinaryArray() to read all fields again. So we use a more efficient way to
serialize the row. We just read each field in a row once and then directly
flush with com.google.protobuf.CodedOutputStream. It is 3 times faster than
java native API. The implementation use some native and basic API of protobuf.
Below is my benchmark result:
*10M rows to convert from pb to row*
Json : 105s
DynamicMessage: 166s
Code gen: 40s
*10M rows to convert from row to pb*
json: 106s
DynamicMessage: 69s
CodedOutputStream: 23s
> Introduce Protobuf format
> -------------------------
>
> Key: FLINK-18202
> URL: https://issues.apache.org/jira/browse/FLINK-18202
> Project: Flink
> Issue Type: New Feature
> Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile), Table
> SQL / API
> Reporter: Benchao Li
> Priority: Major
> Attachments: image-2020-06-15-17-18-03-182.png
>
>
> PB[1] is a very famous and wildly used (de)serialization framework. The ML[2]
> also has some discussions about this. It's a useful feature.
> This issue maybe needs some designs, or a FLIP.
> [1] [https://developers.google.com/protocol-buffers]
> [2] [http://apache-flink.147419.n8.nabble.com/Flink-SQL-UDF-td3725.html]
--
This message was sent by Atlassian Jira
(v8.3.4#803005)