On Tue, Oct 11, 2016 at 10:56 AM Eugene Kirpichov <[email protected]> wrote:
> Yeah, I'm starting to lean towards removing Redistribute.byKey() from the > public API - because it only makes sense for getting access to per-key > state, and 1) we don't have it yet and 2) runner should insert it > automatically - so there's no use case for it. +1 to removing Redistribute.byKey() from the public API. > The "checkpointing keys" use > case should be done via Redistribute.arbitrarily(), I believe. > Actually I think it really does have to be a GroupByKey followed by writing the groups without breaking them up: - With GBK each element must appear in exactly one output group, so you have to checkpoint or be able to retract groupings (nice easy explanation from Thomas Groh; any faults in my paraphrase are my own). - But GBK followed by "output the elements one by one" actually removes this property. Now you can replace the whole thing with a no-op and fuse with downstream and still get exactly once processing according to the model but not as observed via side effects to external systems. So sinks should really be doing that, and I'll retract this use case for Redistribute. As for Redistribute.arbitrarily(): > In a batch runner, we could describe it as "identity transform, but runner > is required to process the resulting PCollection with downstream transforms > as well as if it had been created from elements via Create.of(), in terms > of ability to parallelize processing and minimize amount of re-executed > work in case of failures" (which is a fancy way of saying "materialize" > without really saying "materialize" :) ). > How can you observe if the runner ignored you?
