tol created BEAM-14284:
--------------------------

             Summary: Server-side Dataflow job idempotence
                 Key: BEAM-14284
                 URL: https://issues.apache.org/jira/browse/BEAM-14284
             Project: Beam
          Issue Type: Improvement
          Components: runner-dataflow
            Reporter: tol


*Issue*: when a job submission is retried, it may result in duplicate Dataflow 
jobs. The Dataflow job {{name}} only guarantees uniqueness for _active_ jobs -- 
that is, if a job with the same name exists but is already completed, the same 
{{name}} is allowed again. What we would like is job uniqueness regardless of 
job status.

The Dataflow API provides a way to ensure unique jobs through the use of 
{{clientRequestId}}:
{code:java}
The client's unique identifier of the job, re-used 
across retried attempts. If this field is set, the service will ensure 
its uniqueness. The request to create a job will fail if the service has
 knowledge of a previously submitted job with the same client's ID and 
job name. The caller may use this field to ensure idempotence of job 
creation across retried attempts to create a job. By default, the field 
is empty and, in that case, the service ignores it. {code}
[https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.locations.jobs]

In DataflowRunner.java, {{clientRequestId}} is set with [a randomized 
value|https://github.com/apache/beam/blob/v2.37.0/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java#L1125].

*Proposed solution*: provide the ability to pass in a {{clientRequestId}} 
through {{DataflowPipelineOptions}} and set it on the {{Job}} when available, 
otherwise default to the randomized value.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to