luchunliang opened a new issue, #12141:
URL: https://github.com/apache/inlong/issues/12141

   ### What happened
   
   What happened
   
   When using transform-sdk to decode Protobuf messages and write to typed 
sinks (Iceberg/Parquet via RowData), unset fields are incorrectly output as 
protobuf default values (0, "", false, empty list) instead of null. This causes 
downstream sinks to write wrong data — for example, an unset int64 field 
appears as 0 in Iceberg rather than NULL.
   
   Additionally, when upstream protobuf messages have missing required fields 
(common in production with schema evolution), the decoder throws 
UninitializedMessageException and drops the entire record.
   
   Root Cause
   
   In protobuf-java, DynamicMessage.getField(fieldDesc) never returns null for 
unset fields — it returns the type's default value. The code must call 
hasField() before getField() to distinguish "field is set to 0" from "field is 
not set". Multiple locations in PbSourceData and PbSourceDecoder are missing 
this check.
   
   Impact
   
   Data correctness: Unset numeric fields (int/long/float/double) are written 
as 0 instead of NULL in Iceberg — indistinguishable from a legitimately set 
zero value
   Data loss: Messages with missing proto2 required fields cause 
UninitializedMessageException, entire record silently dropped
   Null semantics broken: transformForBytes() converts null field values to 
empty string "", preventing downstream RowData encoders from emitting proper 
NULL values
   Fix
   
   PbSourceDecoder.decode(): Use buildPartial() instead of build() to tolerate 
missing required fields
   PbSourceData.buildStructData(): Add hasField() check before getField() for 
non-repeated fields
   PbSourceData.findNodeValue(): Add hasField() check before getField() for 
non-repeated fields
   PbSourceData.buildMapData() / parseMapNode(): Add hasField() check for map 
entry key/value
   TransformProcessor.transformForBytes(): Pass null instead of "" when field 
value is null, preserving null semantics for binary sinks
   Affected Versions
   
   inlong-sdk/transform-sdk (all versions up to current master)
   
   ### What you expected to happen
   
   What happened
   
   When using transform-sdk to decode Protobuf messages and write to typed 
sinks (Iceberg/Parquet via RowData), unset fields are incorrectly output as 
protobuf default values (0, "", false, empty list) instead of null. This causes 
downstream sinks to write wrong data — for example, an unset int64 field 
appears as 0 in Iceberg rather than NULL.
   
   Additionally, when upstream protobuf messages have missing required fields 
(common in production with schema evolution), the decoder throws 
UninitializedMessageException and drops the entire record.
   
   Root Cause
   
   In protobuf-java, DynamicMessage.getField(fieldDesc) never returns null for 
unset fields — it returns the type's default value. The code must call 
hasField() before getField() to distinguish "field is set to 0" from "field is 
not set". Multiple locations in PbSourceData and PbSourceDecoder are missing 
this check.
   
   Impact
   
   Data correctness: Unset numeric fields (int/long/float/double) are written 
as 0 instead of NULL in Iceberg — indistinguishable from a legitimately set 
zero value
   Data loss: Messages with missing proto2 required fields cause 
UninitializedMessageException, entire record silently dropped
   Null semantics broken: transformForBytes() converts null field values to 
empty string "", preventing downstream RowData encoders from emitting proper 
NULL values
   Fix
   
   PbSourceDecoder.decode(): Use buildPartial() instead of build() to tolerate 
missing required fields
   PbSourceData.buildStructData(): Add hasField() check before getField() for 
non-repeated fields
   PbSourceData.findNodeValue(): Add hasField() check before getField() for 
non-repeated fields
   PbSourceData.buildMapData() / parseMapNode(): Add hasField() check for map 
entry key/value
   TransformProcessor.transformForBytes(): Pass null instead of "" when field 
value is null, preserving null semantics for binary sinks
   Affected Versions
   
   inlong-sdk/transform-sdk (all versions up to current master)
   
   ### How to reproduce
   
   What happened
   
   When using transform-sdk to decode Protobuf messages and write to typed 
sinks (Iceberg/Parquet via RowData), unset fields are incorrectly output as 
protobuf default values (0, "", false, empty list) instead of null. This causes 
downstream sinks to write wrong data — for example, an unset int64 field 
appears as 0 in Iceberg rather than NULL.
   
   Additionally, when upstream protobuf messages have missing required fields 
(common in production with schema evolution), the decoder throws 
UninitializedMessageException and drops the entire record.
   
   Root Cause
   
   In protobuf-java, DynamicMessage.getField(fieldDesc) never returns null for 
unset fields — it returns the type's default value. The code must call 
hasField() before getField() to distinguish "field is set to 0" from "field is 
not set". Multiple locations in PbSourceData and PbSourceDecoder are missing 
this check.
   
   Impact
   
   Data correctness: Unset numeric fields (int/long/float/double) are written 
as 0 instead of NULL in Iceberg — indistinguishable from a legitimately set 
zero value
   Data loss: Messages with missing proto2 required fields cause 
UninitializedMessageException, entire record silently dropped
   Null semantics broken: transformForBytes() converts null field values to 
empty string "", preventing downstream RowData encoders from emitting proper 
NULL values
   Fix
   
   PbSourceDecoder.decode(): Use buildPartial() instead of build() to tolerate 
missing required fields
   PbSourceData.buildStructData(): Add hasField() check before getField() for 
non-repeated fields
   PbSourceData.findNodeValue(): Add hasField() check before getField() for 
non-repeated fields
   PbSourceData.buildMapData() / parseMapNode(): Add hasField() check for map 
entry key/value
   TransformProcessor.transformForBytes(): Pass null instead of "" when field 
value is null, preserving null semantics for binary sinks
   Affected Versions
   
   inlong-sdk/transform-sdk (all versions up to current master)
   
   ### Environment
   
   _No response_
   
   ### InLong version
   
   master
   
   ### InLong Component
   
   InLong SDK
   
   ### Are you willing to submit PR?
   
   - [x] Yes, I am willing to submit a PR!
   
   ### Code of Conduct
   
   - [x] I agree to follow this project's [Code of 
Conduct](https://www.apache.org/foundation/policies/conduct)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to