Hey Kenn- this seems important, but I don't have all the context on what the problem is.
Can you explain this sentence "Specifically, there is pseudorandom data generated and once it has been observed and used to produce a side effect, it cannot be regenerated without erroneous results." ? Where is the pseudorandom data coming from? Perhaps a concrete example would help? S On Tue, Mar 21, 2017 at 1:22 PM Kenneth Knowles <k...@google.com.invalid> wrote: > Problem: > > I will drop all nuance and say that the `Write` transform as it exists in > the SDK is incorrect until we add some specification and APIs. We can't > keep shipping an SDK with an unsafe transform in it, and IMO this certainly > blocks a stable release. > > Specifically, there is pseudorandom data generated and once it has been > observed and used to produce a side effect, it cannot be regenerated > without erroneous results. > > This generalizes: For some side-effecting user-defined functions, it is > vital that even across retries/replays they have a consistent view of the > contents of their input PCollection, because their effect on the outside > world cannot be retracted if/when they fail and are retried. Once the > runner ensures a consistent view of the input, it is then their own > responsibility to be idempotent. > > Ideally we should specify this requirement for the user-defined function > without imposing any particular implementation strategy on Beam runners. > > Proposal: > > 1. Let a DoFn declare (mechanism not important right now) that it "requires > deterministic input". > > 2. Each runner will need a way to induce deterministic input - the obvious > choice being a materialization. > > I want to keep the discussion focused, so I'm leaving out any possibilities > of taking this further. > > Regarding performance: Today places that require this tend to be already > paying the cost via GroupByKey / Reshuffle operations, since that was a > simple way to induce determinism in batch Dataflow* (doesn't work for most > other runners nor for streaming Dataflow). This change will replace a > hard-coded implementation strategy with a requirement that may be fulfilled > in the most efficient way available. > > Thoughts? > > Kenn (w/ lots of consult from colleagues, especially Ben) > > * There is some overlap with the reshuffle/redistribute discussion because > of this historical situation, but I would like to leave that broader > discussion out of this correctness issue. >