damccorm opened a new issue, #20056: URL: https://github.com/apache/beam/issues/20056
GCP documentation heavily [promotes](https://cloud.google.com/dataflow/docs/concepts/streaming-with-cloud-pubsub) Beam's PubSubIO for Pub/Sub message deduplication. Yet nowhere in the documentation, including the [source code](https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubIO.java), tells users how long this deduplication is supposed to last. In [`PubsubIO.java`](https://github.com/apache/beam/blob/a24bc3bae54f089b93bd66a118bd4bf09dbc9254/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubIO.java#L842-L853): ``` /** * When reading from Cloud Pub/Sub where unique record identifiers are provided as Pub/Sub * message attributes, specifies the name of the attribute containing the unique identifier. The * value of the attribute can be any string that uniquely identifies this record. * * <p>Pub/Sub cannot guarantee that no duplicate data will be delivered on the Pub/Sub stream. * If {@code idAttribute} is not provided, Beam cannot guarantee that no duplicate data will be * delivered, and deduplication of the stream will be strictly best effort. */ public Read<T> withIdAttribute(String idAttribute) { return toBuilder().setIdAttribute(idAttribute).build(); } ``` This information here isn't enough for users to know if a second message, published with the same custom IdAttribute as that of a first message, which was published `x` minutes ago, would be deduplicated by the Dataflow runner. Better documentation will help. I imagine a lot of users will wonder about this and may even ask how to configure this period, but that will probably need a separate ticket. Imported from Jira [BEAM-9354](https://issues.apache.org/jira/browse/BEAM-9354). Original Jira may contain additional context. Reported by: tianzi. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
