Re: Introducing a Redistribute transform

Kenneth Knowles Tue, 11 Oct 2016 11:35:29 -0700

On Tue, Oct 11, 2016 at 10:56 AM Eugene Kirpichov
<[email protected]> wrote:


> Yeah, I'm starting to lean towards removing Redistribute.byKey() from the
> public API - because it only makes sense for getting access to per-key
> state, and 1) we don't have it yet and 2) runner should insert it
> automatically - so there's no use case for it.


+1 to removing Redistribute.byKey() from the public API.


> The "checkpointing keys" use
> case should be done via Redistribute.arbitrarily(), I believe.
>

Actually I think it really does have to be a GroupByKey followed by writing
the groups without breaking them up:

 - With GBK each element must appear in exactly one output group, so you
have to checkpoint or be able to retract groupings (nice easy explanation
from Thomas Groh; any faults in my paraphrase are my own).

 - But GBK followed by "output the elements one by one" actually removes
this property. Now you can replace the whole thing with a no-op and fuse
with downstream and still get exactly once processing according to the
model but not as observed via side effects to external systems. So sinks
should really be doing that, and I'll retract this use case for
Redistribute.

As for Redistribute.arbitrarily():
> In a batch runner, we could describe it as "identity transform, but runner
> is required to process the resulting PCollection with downstream transforms
> as well as if it had been created from elements via Create.of(), in terms
> of ability to parallelize processing and minimize amount of re-executed
> work in case of failures" (which is a fancy way of saying "materialize"
> without really saying "materialize" :) ).
>

How can you observe if the runner ignored you?

Re: Introducing a Redistribute transform

Reply via email to