[ 
https://issues.apache.org/jira/browse/FLINK-33817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sai Sharath Dandi updated FLINK-33817:
--------------------------------------
    Description: 
*Background*

 

The current Protobuf format 
[implementation|https://github.com/apache/flink/blob/c3e2d163a637dca5f49522721109161bd7ebb723/flink-formats/flink-protobuf/src/main/java/org/apache/flink/formats/protobuf/deserialize/ProtoToRowConverter.java]
 always sets ReadDefaultValues=False when using Proto3 version. This can cause 
severe performance degradation for large Protobuf schemas with OneOf fields as 
the entire generated code needs to be executed during deserialization even when 
certain fields are not present in the data to be deserialized and all the 
subsequent nested Fields can be skipped. Proto3 supports hasXXX() methods for 
checking field presence for non primitive types since Proto version 
[3.15|https://github.com/protocolbuffers/protobuf/releases/tag/v3.15.0]. In the 
internal performance benchmarks in our company, we've seen almost 10x 
difference in performance for one of our real production usecase when allowing 
to set ReadDefaultValues=False with proto3 version. The exact difference in 
performance depends on the schema complexity and data payload but we should 
allow user to set readDefaultValue=False in general.

 

*Solution*

 

Support using ReadDefaultValues=False when using Proto3 version. We need to be 
careful to check for field presence only on non-primitive types if 
ReadDefaultValues is false and version used is Proto3

  was:
*Background*

 

The current Protobuf format 
[implementation|https://github.com/apache/flink/blob/c3e2d163a637dca5f49522721109161bd7ebb723/flink-formats/flink-protobuf/src/main/java/org/apache/flink/formats/protobuf/deserialize/ProtoToRowConverter.java]
 always sets ReadDefaultValues=False when using Proto3 version. This can cause 
severe performance degradation for large Protobuf schemas with OneOf fields as 
the entire generated code needs to be executed during deserialization even when 
certain fields are not present in the data to be deserialized and all the 
subsequent nested Fields can be skipped. Proto3 supports hasXXX() methods for 
checking field presence for non primitive types since Proto version 
[3.15|https://github.com/protocolbuffers/protobuf/releases/tag/v3.15.0]. In the 
internal performance benchmarks in our company, we've seen almost 10x 
difference in performance for one of our real production usecase when allowing 
to set ReadDefaultValues=False with proto3 version. The exact difference in 
performance depends on the schema complexity and data payload but we should 
allow readDefaultValue=False in general.

 

*Solution*

 

Support using ReadDefaultValues=False when using Proto3 version. We need to be 
careful to check for field presence only on non-primitive types if 
ReadDefaultValues is false and version used is Proto3


> Allow ReadDefaultValues = False for non primitive types on Proto3
> -----------------------------------------------------------------
>
>                 Key: FLINK-33817
>                 URL: https://issues.apache.org/jira/browse/FLINK-33817
>             Project: Flink
>          Issue Type: Improvement
>          Components: Formats (JSON, Avro, Parquet, ORC, SequenceFile)
>    Affects Versions: 1.18.0
>            Reporter: Sai Sharath Dandi
>            Priority: Major
>
> *Background*
>  
> The current Protobuf format 
> [implementation|https://github.com/apache/flink/blob/c3e2d163a637dca5f49522721109161bd7ebb723/flink-formats/flink-protobuf/src/main/java/org/apache/flink/formats/protobuf/deserialize/ProtoToRowConverter.java]
>  always sets ReadDefaultValues=False when using Proto3 version. This can 
> cause severe performance degradation for large Protobuf schemas with OneOf 
> fields as the entire generated code needs to be executed during 
> deserialization even when certain fields are not present in the data to be 
> deserialized and all the subsequent nested Fields can be skipped. Proto3 
> supports hasXXX() methods for checking field presence for non primitive types 
> since Proto version 
> [3.15|https://github.com/protocolbuffers/protobuf/releases/tag/v3.15.0]. In 
> the internal performance benchmarks in our company, we've seen almost 10x 
> difference in performance for one of our real production usecase when 
> allowing to set ReadDefaultValues=False with proto3 version. The exact 
> difference in performance depends on the schema complexity and data payload 
> but we should allow user to set readDefaultValue=False in general.
>  
> *Solution*
>  
> Support using ReadDefaultValues=False when using Proto3 version. We need to 
> be careful to check for field presence only on non-primitive types if 
> ReadDefaultValues is false and version used is Proto3



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to