[ 
https://issues.apache.org/jira/browse/BEAM-14284?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17525899#comment-17525899
 ] 

Kenneth Knowles commented on BEAM-14284:
----------------------------------------

I am not familiar with clientRequestId but it seems like it could be a good 
fit. It surely originates just as a way to deduplicate requests on the 
receiving side (not sure exactly how they get duplicated or why the underlying 
protocol doesn't handle this, but eh). Your use case seems pretty similar to 
this. I'd want to get other opinions but would you like to contribute this 
change?

> Server-side Dataflow job idempotence
> ------------------------------------
>
>                 Key: BEAM-14284
>                 URL: https://issues.apache.org/jira/browse/BEAM-14284
>             Project: Beam
>          Issue Type: Improvement
>          Components: runner-dataflow
>            Reporter: tol
>            Priority: P2
>
> *Issue*: when a job submission is retried, it may result in duplicate 
> Dataflow jobs. The Dataflow job {{name}} only guarantees uniqueness for 
> _active_ jobs -- that is, if a job with the same name exists but is already 
> completed, the same {{name}} is allowed again. What we would like is job 
> uniqueness regardless of job status.
> The Dataflow API provides a way to ensure unique jobs through the use of 
> {{clientRequestId}}:
> {code:java}
> The client's unique identifier of the job, re-used 
> across retried attempts. If this field is set, the service will ensure 
> its uniqueness. The request to create a job will fail if the service has
>  knowledge of a previously submitted job with the same client's ID and 
> job name. The caller may use this field to ensure idempotence of job 
> creation across retried attempts to create a job. By default, the field 
> is empty and, in that case, the service ignores it. {code}
> [https://cloud.google.com/dataflow/docs/reference/rest/v1b3/projects.locations.jobs]
> In DataflowRunner.java, {{clientRequestId}} is set with [a randomized 
> value|https://github.com/apache/beam/blob/v2.37.0/runners/google-cloud-dataflow-java/src/main/java/org/apache/beam/runners/dataflow/DataflowRunner.java#L1125].
> *Proposed solution*: provide the ability to pass in a {{clientRequestId}} 
> through {{DataflowPipelineOptions}} and set it on the {{Job}} when available, 
> otherwise default to the randomized value.



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

Reply via email to