egalpin commented on issue #23525: URL: https://github.com/apache/beam/issues/23525#issuecomment-1281294275
@apilloud It's introduced by https://github.com/apache/beam/pull/22216 rather than fixed. Part of the complexity, at least for me, is the multiple coders registered against the same class in [PubsubCoderProviderRegistrar](https://github.com/apache/beam/blob/693725d6017d2ef6572a12c8688caef5a601fa97/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubCoderProviderRegistrar.java). Since all pubsub messages of type `org.apache.beam.sdk.io.gcp.pubsub.PubsubMessage` appear to use the first coder in the registrar by default, any messages with a superset of properties compared to [PubsubMessageWithAttributesCoder](https://github.com/apache/beam/blob/693725d6017d2ef6572a12c8688caef5a601fa97/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubMessageWithAttributesCoder.java) will have those properties nullified silently. Assuming that the "superset" coder which supports all the fields were the default, what would the drawback be in the case where, say, the messageId would never be used? It would mean more bytes to represent the PubsubMessage, but are there other drawbacks? Otherwise it seems like it might be easiest to make the "superset" coder the default, and view other coders as data-size optimizations that users could opt in to. I do like the idea of raising an error in the coders in the case where unsupported fields are non-null, I think this could be an effective way to defend against data loss that could be implemented relatively quickly. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
