[ 
https://issues.apache.org/jira/browse/BEAM-9354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17548424#comment-17548424
 ] 

Danny McCormick commented on BEAM-9354:
---------------------------------------

This issue has been migrated to https://github.com/apache/beam/issues/20056

> How long does PubSubIO message deduplication last?
> --------------------------------------------------
>
>                 Key: BEAM-9354
>                 URL: https://issues.apache.org/jira/browse/BEAM-9354
>             Project: Beam
>          Issue Type: Improvement
>          Components: io-java-gcp
>            Reporter: Tianzi Cai
>            Priority: P3
>              Labels: gcp, pubsubio
>
> GCP documentation heavily 
> [promotes|https://cloud.google.com/dataflow/docs/concepts/streaming-with-cloud-pubsub]
>  Beam's PubSubIO for Pub/Sub message deduplication. Yet nowhere in the 
> documentation, including the [source 
> code|https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubIO.java],
>  tells users how long this deduplication is supposed to last. 
> In 
> [`PubsubIO.java`|https://github.com/apache/beam/blob/a24bc3bae54f089b93bd66a118bd4bf09dbc9254/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubIO.java#L842-L853]:
> {code:java}
>     /**
>      * When reading from Cloud Pub/Sub where unique record identifiers are 
> provided as Pub/Sub
>      * message attributes, specifies the name of the attribute containing the 
> unique identifier. The
>      * value of the attribute can be any string that uniquely identifies this 
> record.
>      *
>      * <p>Pub/Sub cannot guarantee that no duplicate data will be delivered 
> on the Pub/Sub stream.
>      * If {@code idAttribute} is not provided, Beam cannot guarantee that no 
> duplicate data will be
>      * delivered, and deduplication of the stream will be strictly best 
> effort.
>      */
>     public Read<T> withIdAttribute(String idAttribute) {
>       return toBuilder().setIdAttribute(idAttribute).build();
>     }
> {code}
> This information here isn't enough for users to know if a second message, 
> published with the same custom IdAttribute as that of a first message, which 
> was published `x` minutes ago, would be deduplicated by the Dataflow runner. 
> Better documentation will help. I imagine a lot of users will wonder about 
> this and may even ask how to configure this period, but that will probably 
> need a separate ticket.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to