Tianzi Cai created BEAM-9354:
--------------------------------
Summary: How long does PubSubIO message deduplication last?
Key: BEAM-9354
URL: https://issues.apache.org/jira/browse/BEAM-9354
Project: Beam
Issue Type: Improvement
Components: io-java-gcp
Reporter: Tianzi Cai
GCP documentation heavily
[promotes|https://cloud.google.com/dataflow/docs/concepts/streaming-with-cloud-pubsub]
Beam's PubSubIO for Pub/Sub message deduplication. Yet nowhere in the
documentation, including the [source
code|https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubIO.java],
tells users how long this deduplication is supposed to last.
In
[`PubsubIO.java`|https://github.com/apache/beam/blob/a24bc3bae54f089b93bd66a118bd4bf09dbc9254/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubIO.java#L842-L853]:
{code:java}
/**
* When reading from Cloud Pub/Sub where unique record identifiers are
provided as Pub/Sub
* message attributes, specifies the name of the attribute containing the
unique identifier. The
* value of the attribute can be any string that uniquely identifies this
record.
*
* <p>Pub/Sub cannot guarantee that no duplicate data will be delivered on
the Pub/Sub stream.
* If {@code idAttribute} is not provided, Beam cannot guarantee that no
duplicate data will be
* delivered, and deduplication of the stream will be strictly best effort.
*/
public Read<T> withIdAttribute(String idAttribute) {
return toBuilder().setIdAttribute(idAttribute).build();
}
{code}
This information here isn't enough for users to know if a second message,
published with the same custom IdAttribute as that of a first message, which
was published `x` minutes ago, would be deduplicated by the Dataflow runner.
Better documentation will help. I imagine a lot of users will wonder about this
and may even ask how to configure this period, but that will probably need a
separate ticket.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)