[jira] [Commented] (FLINK-18202) Introduce Protobuf format

Suhan Mao (Jira) Tue, 17 Nov 2020 03:29:07 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-18202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17233527#comment-17233527
 ]


Suhan Mao commented on FLINK-18202:
-----------------------------------

I have some experience and already implemented the code of converting between 
protobuf and row. Both of below two features are already online and serve 
millions of QPS now. We have implemented using proto2 but it is not a difficult 
thing to migrate to proto3.
 # protobuf -> row。According to benchao's benchmark result, java native API is 
over 4 times faster than dynamic message. It is also the case we met in our 
real environment, so we use janino codegen way to convert pb to flink row 
according to the pb Descriptor and required schema. We have already finished 
the implementation and it is very close to the native API performance.We 
support all kinds of complex types in pb including nested message or map.
 # row -> protobuf. In our test, the java native API to serialize is still not 
fast enough. Because in the whole process, we need to access all columns 3 
times. (1) getter setter method of builder, (2) build() to copy the fields (3) 
toBinaryArray() to read all fields again. So we use a more efficient way to 
serialize the row. We just read each field in a row once and then directly 
flush with com.google.protobuf.CodedOutputStream. It is 3 times faster than 
java native API. The implementation use some native and basic API of protobuf.

Below is my benchmark result:

*10M rows to convert from pb to row*

Json : 105s

DynamicMessage:  166s

Code gen: 40s

 

*10M rows to convert from row to pb*

json: 106s 

DynamicMessage: 69s

CodedOutputStream: 23s

> Introduce Protobuf format
> -------------------------
>
>                 Key: FLINK-18202
>                 URL: https://issues.apache.org/jira/browse/FLINK-18202
>             Project: Flink
>          Issue Type: New Feature
>          Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile), Table 
> SQL / API
>            Reporter: Benchao Li
>            Priority: Major
>         Attachments: image-2020-06-15-17-18-03-182.png
>
>
> PB[1] is a very famous and wildly used (de)serialization framework. The ML[2] 
> also has some discussions about this. It's a useful feature.
> This issue maybe needs some designs, or a FLIP.
> [1] [https://developers.google.com/protocol-buffers]
> [2] [http://apache-flink.147419.n8.nabble.com/Flink-SQL-UDF-td3725.html]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (FLINK-18202) Introduce Protobuf format

Reply via email to