[jira] [Commented] (FLINK-32650) Added the ability to split flink-protobuf codegen code

Suhan Mao (Jira) Sat, 29 Jul 2023 20:11:06 -0700


    [ 
https://issues.apache.org/jira/browse/FLINK-32650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17748873#comment-17748873
 ]


Suhan Mao commented on FLINK-32650:
-----------------------------------

[~lijingwei.5018] Thanks for opening this ticket and share the solution. 

According to the benchmark test in production usage, the code of single method 
generated by protoc is probably larger than the code generated by flink 
protobuf. If the number of fields is too large, it still have impact on the 
performance. 

I see some related issues of protobuf, 
[https://github.com/protocolbuffers/protobuf/issues/10247,] 
[https://github.com/protocolbuffers/protobuf/pull/10367,] It seems there's 
someone already knew this problem and tried to fix that in protobuf but it has 
no progress now.  If we can push protobuf to do the similar change, the 
performance issue can be nicely solved for this case. Of course, this should 
not block current ticket's job.

> Added the ability to split flink-protobuf codegen code
> ------------------------------------------------------
>
>                 Key: FLINK-32650
>                 URL: https://issues.apache.org/jira/browse/FLINK-32650
>             Project: Flink
>          Issue Type: Improvement
>          Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
>    Affects Versions: 1.17.1
>            Reporter: 李精卫
>            Assignee: 李精卫
>            Priority: Major
>
> h3. backgroud
> Flink serializes and deserializes protobuf format data by calling the decode 
> or encode method in GeneratedProtoToRow_XXX.java generated by codegen to 
> parse byte[] data into protobuf java objects. The size of the decode/encode 
> codegen method body is strongly related to the number of defined fields in 
> protobuf. When the number of fields exceeds a certain threshold and the 
> compiled method body exceeds 8k, the decode/encode method will not be 
> optimized by JIT, seriously affecting serialization or deserialization 
> performance. Even if the compiled method body exceeds 64k, it will directly 
> cause the task to fail to start.
> h3. solution
> So I proposed Codegen Splitter for protobuf parsing to split the 
> encode/decode method to solve this problem.
> The specific idea is as follows. In the current decode/encode method, each 
> field defined for the protobuf message is placed in the method body. In fact, 
> there are no shared parameters between the fields, so multiple fields can be 
> merged and parsed and written into the split method body. If the number of 
> strings in the current method body exceeds the threshold, a split method will 
> be generated, these fields will be parsed in the split method, and the split 
> method will be called in the decode/encode method. By analogy, the 
> decode/encode method including the split method is finally generated.
> after spilt code example
>  
> {code:java}
> //代码占位符
> public static RowData 
> decode(org.apache.flink.formats.protobuf.testproto.AdProfile.AdProfilePb 
> message){
> RowData rowData=null;
> org.apache.flink.formats.protobuf.testproto.AdProfile.AdProfilePb message1242 
> = message;
> GenericRowData rowData1242 = new GenericRowData(5);
> split2585(rowData1242, message1242);
> rowData = rowData1242;return rowData;
> }
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-32650) Added the ability to split flink-protobuf codegen code

Reply via email to