egalpin commented on issue #23525:
URL: https://github.com/apache/beam/issues/23525#issuecomment-1281294275

   @apilloud It's introduced by https://github.com/apache/beam/pull/22216 
rather than fixed.  Part of the complexity, at least for me, is the multiple 
coders registered against the same class in 
[PubsubCoderProviderRegistrar](https://github.com/apache/beam/blob/693725d6017d2ef6572a12c8688caef5a601fa97/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubCoderProviderRegistrar.java).
  Since all pubsub messages of type 
`org.apache.beam.sdk.io.gcp.pubsub.PubsubMessage` appear to use the first coder 
in the registrar by default, any messages with a superset of properties 
compared to 
[PubsubMessageWithAttributesCoder](https://github.com/apache/beam/blob/693725d6017d2ef6572a12c8688caef5a601fa97/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubMessageWithAttributesCoder.java)
 will have those properties nullified silently.
   
   Assuming that the "superset" coder which supports all the fields were the 
default, what would the drawback be in the case where, say, the messageId would 
never be used?  It would mean more bytes to represent the PubsubMessage, but 
are there other drawbacks?  Otherwise it seems like it might be easiest to make 
the "superset" coder the default, and view other coders as data-size 
optimizations that users could opt in to.
   
   I do like the idea of raising an error in the coders in the case where 
unsupported fields are non-null, I think this could be an effective way to 
defend against data loss that could be implemented relatively quickly.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to