tol created BEAM-14284:
--------------------------
Summary: Server-side Dataflow job idempotence
Key: BEAM-14284
URL: https://issues.apache.org/jira/browse/BEAM-14284
Project: Beam
Issue Type: Improvement
Components: runner-dataflow
Reporter: tol
*Issue*: when a job submission is retried, it may result in duplicate Dataflow
jobs. The Dataflow job {{name}} only guarantees uniqueness for _active_ jobs --
that is, if a job with the same name exists but is already completed, the same
{{name}} is allowed again. What we would like is job uniqueness regardless of
job status.
The Dataflow API provides a way to ensure unique jobs through the use of
{{clientRequestId}}:
{code:java}
The client's unique identifier of the job, re-used
across retried attempts. If this field is set, the service will ensure
its uniqueness. The request to create a job will fail if the service has
knowledge of a previously submitted job with the same client's ID and
job name. The caller may use this field to ensure idempotence of job
creation across retried attempts to create a job. By default, the field
is empty and, in that case, the service ignores it. {code}
[https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.locations.jobs]
In DataflowRunner.java, {{clientRequestId}} is set with [a randomized
value|https://github.com/apache/beam/blob/v2.37.0/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java#L1125].
*Proposed solution*: provide the ability to pass in a {{clientRequestId}}
through {{DataflowPipelineOptions}} and set it on the {{Job}} when available,
otherwise default to the randomized value.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)