[
https://issues.apache.org/jira/browse/BEAM-758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15858964#comment-15858964
]
Kenneth Knowles commented on BEAM-758:
--------------------------------------
The spirit works, as far as the Fn API will just be shipping this as "extra
stuff" that the SDK repackages and provides to its variety of DoFn. To phrase
it in terms of the new {{DoFn}} as Luke suggests, here are some Java API
options:
{code}
@ProcessElement
public void processElement(ProcessContext c, @JobNonce String jobNonce, @StepId
String stepId)
{code}
or with little interfaces that serve roughly the same purpose
{code}
@ProcessElement
public void processElement(ProcessContext c, JobNonce jobNonce, StepId stepId)
{code}
The latter is probably less familiar, but has the standard benefit of being
less error-prone due to isolation of the types.
> Per-step, per-execution nonce
> -----------------------------
>
> Key: BEAM-758
> URL: https://issues.apache.org/jira/browse/BEAM-758
> Project: Beam
> Issue Type: Improvement
> Components: sdk-java-core
> Affects Versions: Not applicable
> Reporter: Daniel Halperin
> Assignee: Sam McVeety
>
> In the forthcoming runner API, a user will be able to save a pipeline to JSON
> and then run it repeatedly.
> Many pieces of code (e.g., BigQueryIO.Read or Write) rely on a single random
> value (nonce). These values are typically generated at apply time, so that
> they are deterministic (don't change across retries of DoFns) and global (are
> the same across all workers).
> However, once the runner API lands the existing code would result in the same
> nonce being reused across jobs. Other possible solutions:
> * Generate nonce in {{Create(1) | ParDo}} then use this as a side input.
> Should work, as along as side inputs are actually checkpointed. But does not
> work for {{BoundedSource}}.
> * If a nonce is only needed for the lifetime of one bundle, can be generated
> in {{startBundle}} and used in {{finishBundle}} [or {{tearDown}}].
> * Add some context somewhere that lets user code access unique step name, and
> somehow generate a nonce consistently e.g. by hashing. Will usually work, but
> this is similarly not available to sources.
> Another Q: I'm not sure we have a good way to generate nonces in unbounded
> pipelines -- we probably need one. This would enable us to, e.g., use
> {{BigQueryIO.Write}} in an unbounded pipeline [if we had, e.g., exactly-once
> triggering per window]. Or generalizing to multiple firings...
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)