Hey Kenn-

this seems important, but I don't have all the context on what the problem
is.

Can you explain this sentence "Specifically, there is pseudorandom data
generated and once it has been observed and used to produce a side effect,
it cannot be regenerated without erroneous results." ?

Where is the pseudorandom data coming from? Perhaps a concrete example
would help?

S


On Tue, Mar 21, 2017 at 1:22 PM Kenneth Knowles <k...@google.com.invalid>
wrote:

> Problem:
>
> I will drop all nuance and say that the `Write` transform as it exists in
> the SDK is incorrect until we add some specification and APIs. We can't
> keep shipping an SDK with an unsafe transform in it, and IMO this certainly
> blocks a stable release.
>
> Specifically, there is pseudorandom data generated and once it has been
> observed and used to produce a side effect, it cannot be regenerated
> without erroneous results.
>
> This generalizes: For some side-effecting user-defined functions, it is
> vital that even across retries/replays they have a consistent view of the
> contents of their input PCollection, because their effect on the outside
> world cannot be retracted if/when they fail and are retried. Once the
> runner ensures a consistent view of the input, it is then their own
> responsibility to be idempotent.
>
> Ideally we should specify this requirement for the user-defined function
> without imposing any particular implementation strategy on Beam runners.
>
> Proposal:
>
> 1. Let a DoFn declare (mechanism not important right now) that it "requires
> deterministic input".
>
> 2. Each runner will need a way to induce deterministic input - the obvious
> choice being a materialization.
>
> I want to keep the discussion focused, so I'm leaving out any possibilities
> of taking this further.
>
> Regarding performance: Today places that require this tend to be already
> paying the cost via GroupByKey / Reshuffle operations, since that was a
> simple way to induce determinism in batch Dataflow* (doesn't work for most
> other runners nor for streaming Dataflow). This change will replace a
> hard-coded implementation strategy with a requirement that may be fulfilled
> in the most efficient way available.
>
> Thoughts?
>
> Kenn (w/ lots of consult from colleagues, especially Ben)
>
> * There is some overlap with the reshuffle/redistribute discussion because
> of this historical situation, but I would like to leave that broader
> discussion out of this correctness issue.
>

Reply via email to