[
https://issues.apache.org/jira/browse/BEAM-9354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17121675#comment-17121675
]
Kenneth Knowles commented on BEAM-9354:
---------------------------------------
This issue is assigned but has not received an update in 30 days so it has been
labeled "stale-assigned". If you are still working on the issue, please give an
update and remove the label. If you are no longer working on the issue, please
unassign so someone else may work on it. In 7 days the issue will be
automatically unassigned.
> How long does PubSubIO message deduplication last?
> --------------------------------------------------
>
> Key: BEAM-9354
> URL: https://issues.apache.org/jira/browse/BEAM-9354
> Project: Beam
> Issue Type: Improvement
> Components: io-java-gcp
> Reporter: Tianzi Cai
> Assignee: Reuven Lax
> Priority: P2
> Labels: gcp, pubsubio, stale-assigned
>
> GCP documentation heavily
> [promotes|https://cloud.google.com/dataflow/docs/concepts/streaming-with-cloud-pubsub]
> Beam's PubSubIO for Pub/Sub message deduplication. Yet nowhere in the
> documentation, including the [source
> code|https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubIO.java],
> tells users how long this deduplication is supposed to last.
> In
> [`PubsubIO.java`|https://github.com/apache/beam/blob/a24bc3bae54f089b93bd66a118bd4bf09dbc9254/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubIO.java#L842-L853]:
> {code:java}
> /**
> * When reading from Cloud Pub/Sub where unique record identifiers are
> provided as Pub/Sub
> * message attributes, specifies the name of the attribute containing the
> unique identifier. The
> * value of the attribute can be any string that uniquely identifies this
> record.
> *
> * <p>Pub/Sub cannot guarantee that no duplicate data will be delivered
> on the Pub/Sub stream.
> * If {@code idAttribute} is not provided, Beam cannot guarantee that no
> duplicate data will be
> * delivered, and deduplication of the stream will be strictly best
> effort.
> */
> public Read<T> withIdAttribute(String idAttribute) {
> return toBuilder().setIdAttribute(idAttribute).build();
> }
> {code}
> This information here isn't enough for users to know if a second message,
> published with the same custom IdAttribute as that of a first message, which
> was published `x` minutes ago, would be deduplicated by the Dataflow runner.
> Better documentation will help. I imagine a lot of users will wonder about
> this and may even ask how to configure this period, but that will probably
> need a separate ticket.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)