Tianzi Cai created BEAM-9354:
--------------------------------

             Summary: How long does PubSubIO message deduplication last?
                 Key: BEAM-9354
                 URL: https://issues.apache.org/jira/browse/BEAM-9354
             Project: Beam
          Issue Type: Improvement
          Components: io-java-gcp
            Reporter: Tianzi Cai


GCP documentation heavily 
[promotes|https://cloud.google.com/dataflow/docs/concepts/streaming-with-cloud-pubsub]
 Beam's PubSubIO for Pub/Sub message deduplication. Yet nowhere in the 
documentation, including the [source 
code|https://github.com/apache/beam/blob/master/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubIO.java],
 tells users how long this deduplication is supposed to last. 

In 
[`PubsubIO.java`|https://github.com/apache/beam/blob/a24bc3bae54f089b93bd66a118bd4bf09dbc9254/sdks/java/io/google-cloud-platform/src/main/java/org/apache/beam/sdk/io/gcp/pubsub/PubsubIO.java#L842-L853]:
{code:java}
    /**
     * When reading from Cloud Pub/Sub where unique record identifiers are 
provided as Pub/Sub
     * message attributes, specifies the name of the attribute containing the 
unique identifier. The
     * value of the attribute can be any string that uniquely identifies this 
record.
     *
     * <p>Pub/Sub cannot guarantee that no duplicate data will be delivered on 
the Pub/Sub stream.
     * If {@code idAttribute} is not provided, Beam cannot guarantee that no 
duplicate data will be
     * delivered, and deduplication of the stream will be strictly best effort.
     */
    public Read<T> withIdAttribute(String idAttribute) {
      return toBuilder().setIdAttribute(idAttribute).build();
    }
{code}
This information here isn't enough for users to know if a second message, 
published with the same custom IdAttribute as that of a first message, which 
was published `x` minutes ago, would be deduplicated by the Dataflow runner. 

Better documentation will help. I imagine a lot of users will wonder about this 
and may even ask how to configure this period, but that will probably need a 
separate ticket.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to