[jira] [Commented] (FLINK-33611) Support Large Protobuf Schemas

Benchao Li (Jira) Sun, 07 Jan 2024 18:09:04 -0800


    [ 
https://issues.apache.org/jira/browse/FLINK-33611?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17804080#comment-17804080
 ]


Benchao Li commented on FLINK-33611:
------------------------------------

[~dsaisharath] Thanks for the analysis and the effort trying to improve Flink 
Protobuf Format.

bq. Apart from that, making the code change to reduce too many split methods 
has the most impact in supporting large schemas as I found that method names 
are always included in the constant pool even when the code size is too large 
from my experiment. In fact, this is the main reason which causes compilation 
errors with "too many constants error"

I'm wondering if there is a real case that will run into this, if yes, I think 
it's still worth to improve it if there is a way.

bq. With that being said, I would still prefer to keep the changes to reuse 
variable names since the change itself is non-intrusive, harmless, and can only 
improve the performance for compilation. Please let me know your thoughts

I'm inclined to not include it for now, since there code does not solve a real 
problem yet, and might add a small burden to the maintenance since other 
contributors need to understand the code and it's intention. 

 If there is a significant improvement in the compilation, let's reconsider 
this, what do you think?


> Support Large Protobuf Schemas
> ------------------------------
>
>                 Key: FLINK-33611
>                 URL: https://issues.apache.org/jira/browse/FLINK-33611
>             Project: Flink
>          Issue Type: Improvement
>          Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
>    Affects Versions: 1.18.0
>            Reporter: Sai Sharath Dandi
>            Assignee: Sai Sharath Dandi
>            Priority: Major
>              Labels: pull-request-available
>
> h3. Background
> Flink serializes and deserializes protobuf format data by calling the decode 
> or encode method in GeneratedProtoToRow_XXX.java generated by codegen to 
> parse byte[] data into Protobuf Java objects. FLINK-32650 has introduced the 
> ability to split the generated code to improve the performance for large 
> Protobuf schemas. However, this is still not sufficient to support some 
> larger protobuf schemas as the generated code exceeds the java constant pool 
> size [limit|https://en.wikipedia.org/wiki/Java_class_file#The_constant_pool] 
> and we can see errors like "Too many constants" when trying to compile the 
> generated code. 
> *Solution*
> Since we already have the split code functionality already introduced, the 
> main proposal here is to now reuse the variable names across different split 
> method scopes. This will greatly reduce the constant pool size. One more 
> optimization is to only split the last code segment also only when the size 
> exceeds split threshold limit. Currently, the last segment of the generated 
> code is always being split which can lead to too many split methods and thus 
> exceed the constant pool size limit



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (FLINK-33611) Support Large Protobuf Schemas

Reply via email to