Daniel Halperin created BEAM-758:
------------------------------------

             Summary: Per-step, per-execution nonce
                 Key: BEAM-758
                 URL: https://issues.apache.org/jira/browse/BEAM-758
             Project: Beam
          Issue Type: Improvement
          Components: sdk-java-core
    Affects Versions: Not applicable
            Reporter: Daniel Halperin


In the forthcoming runner API, a user will be able to save a pipeline to JSON 
and then run it repeatedly.

Many pieces of code (e.g., BigQueryIO.Read or Write) rely on a single random 
value (nonce). These values are typically generated at apply time, so that they 
are deterministic (don't change across retries of DoFns) and global (are the 
same across all workers).

However, once the runner API lands the existing code would result in the same 
nonce being reused across jobs. Other possible solutions:

* Generate nonce in {{Create(1) | ParDo}} then use this as a side input. Should 
work, as along as side inputs are actually checkpointed. But does not work for 
{{BoundedSource}}.

* If a nonce is only needed for the lifetime of one bundle, can be generated in 
{{startBundle}} and used in {{finishBundle}} [or {{tearDown}}].

* Add some context somewhere that lets user code access unique step name, and 
somehow generate a nonce consistently e.g. by hashing. Will usually work, but 
this is similarly not available to sources.

Another Q: I'm not sure we have a good way to generate nonces in unbounded 
pipelines -- we probably need one. This would enable us to, e.g., use 
{{BigQueryIO.Write}} in an unbounded pipeline [if we had, e.g., exactly-once 
triggering per window]. Or generalizing to multiple firings...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to