Good points & questions. I'll try to be more clear.
> On 21 March 2017 at 13:52, Stephen Sisk <s...@google.com.invalid> wrote: > > > Hey Kenn- > > > > this seems important, but I don't have all the context on what the > problem > > is. > > > > Can you explain this sentence "Specifically, there is pseudorandom data > > generated and once it has been observed and used to produce a side > effect, > > it cannot be regenerated without erroneous results." ? > On Tue, Mar 21, 2017 at 2:04 PM, vikas rk <vikky...@gmail.com> wrote: > > For the Write transform I believe you are talking about ApplyShardingKey > <https://github.com/apache/beam/blob/d66029cafde152c0a46ebd276ddfa4 > c3e7fd3433/sdks/java/core/src/main/java/org/apache/beam/sdk/ > io/Write.java#L304> > which > introduces non deterministic behavior when retried? Yes, exactly this. If the sharding key changes, then the rest of the transform doesn't function correctly. > Where is the pseudorandom data coming from? Perhaps a concrete example > > would help? > I think the Write transform is a particularly complex example because of the layers of abstraction. A simplified strawman might be: Transform 1: Build RPC write descriptors identified by pseudo-random UUIDs. Transform 2: Issue RPCs with those identifiers, so the endpoint will ignore repeats of the same UUID (I tend to call this an "idempotency key" so I might slip into that terminology sometimes) In this case, transform 2 requires deterministic input because if the write fails and is retried a new UUID means the endpoint won't know it is a retry, resulting in duplicate data. Is this clearer? Kenn