Problem:

I will drop all nuance and say that the `Write` transform as it exists in
the SDK is incorrect until we add some specification and APIs. We can't
keep shipping an SDK with an unsafe transform in it, and IMO this certainly
blocks a stable release.

Specifically, there is pseudorandom data generated and once it has been
observed and used to produce a side effect, it cannot be regenerated
without erroneous results.

This generalizes: For some side-effecting user-defined functions, it is
vital that even across retries/replays they have a consistent view of the
contents of their input PCollection, because their effect on the outside
world cannot be retracted if/when they fail and are retried. Once the
runner ensures a consistent view of the input, it is then their own
responsibility to be idempotent.

Ideally we should specify this requirement for the user-defined function
without imposing any particular implementation strategy on Beam runners.

Proposal:

1. Let a DoFn declare (mechanism not important right now) that it "requires
deterministic input".

2. Each runner will need a way to induce deterministic input - the obvious
choice being a materialization.

I want to keep the discussion focused, so I'm leaving out any possibilities
of taking this further.

Regarding performance: Today places that require this tend to be already
paying the cost via GroupByKey / Reshuffle operations, since that was a
simple way to induce determinism in batch Dataflow* (doesn't work for most
other runners nor for streaming Dataflow). This change will replace a
hard-coded implementation strategy with a requirement that may be fulfilled
in the most efficient way available.

Thoughts?

Kenn (w/ lots of consult from colleagues, especially Ben)

* There is some overlap with the reshuffle/redistribute discussion because
of this historical situation, but I would like to leave that broader
discussion out of this correctness issue.

Reply via email to