Re: [DESIGN PROPOSAL] Reshuffle Allowing Duplicates

Jan Lukavský Wed, 31 Jan 2024 01:21:22 -0800

Hi,

if I understand this proposal correctly, the motivation is actuallyreducing latency by bypassing bundle atomic guarantees, bundles after"at least once" Reshuffle would be reconstructed independently of thepre-shuffle bundling. Provided this is correct, it seems that thebehavior is slightly more general than for the case of Reshuffle. Wehave already some transforms that manipulate a specific property of aPCollection - if it may or might not contain duplicates. That ismanipulated in two ways - explicitly removing duplicates based on IDs onsources that generate duplicates and using @RequiresStableInput, mostlyin sinks. These techniques modify an inherent property of a PCollection,that is if it contains or does not contain possible duplicatesoriginating from the same input element.

There are two types of duplicates - duplicate elements in _differentbundles_ (typically from at-least-once sources) and duplicates arisingdue to bundle reprocessing (affecting only transforms with side-effects,that is what we solve by @RequiresStableInput). The point I'm trying toget to - should we add these properties to PCollections (containscross-bundle duplicates vs. does not) and PTransforms ("outputsdeduplicated elements" and "requires stable input")? That would allow usto analyze the Pipeline DAG and provide appropriate implementation forReshuffle automatically, so that a new URN or flag would not be needed.Moreover, this might be useful for a broader range of optimizations.


WDYT?

 Jan

On 1/30/24 23:22, Robert Burke wrote:

Is the benefit of this proposal just the bounded deviation from theexisting reshuffle?
Reshuffle is already rather dictated by arbitrary runner choice, fromsimply ignoring the node, to forcing a materialization break, to afull shuffle implementation which has additional side effects.
But model wise I don't believe it guarantees specific checkpointing orre-execution behavior as currently specified. The proto only says itrepresents the operation (without specifying the behavior, that is abig problem).
I guess my concern here is that it implies/codifies that the existingreshuffle has more behavior than it promises outside of the Java SDK.
"Allowing duplicates" WRT reshuffle is tricky. It feels like mostlyallows an implementation that may mean the inputs into the reshufflemight be re-executed for example. But that's always under the runner'sdiscretion , and ultimately it could also prevent even getting theintended benefit of a reshuffle (notionally, just a fusion break).
Is there even a valid way to implement the notion of a reshuffle thatleads to duplicates outside of a retry/resilience case?
-------
To be clear, I'm not against the proposal. I'm against that its beingbuilt on a non-existent foundation. If the behavior isn't alreadydefined, it's impossible to specify a real deviation from it.
I'm all for more specific behaviors if means we actually clarify whatthe original version is in the protos, since its news to me ( justnow, because I looked) that the Java reshuffle promises GBK-like sideeffects. But that's a long deprecated transform without a satisfyingreplacement for it's usage, so it may be moot.
Robert Burke



On Tue, Jan 30, 2024, 1:34 PM Kenneth Knowles <k...@apache.org> wrote:

    Hi all,

    Just when you thought I had squeezed all the possible interest out
    of this most boring-seeming of transforms :-)

    I wrote up a very quick proposal as a doc [1]. It is short enough
    that I will also put the main idea and main question in this email
    so you can quickly read. Best to put comments in the.

    Main idea: add a variation of Reshuffle that allows duplicates,
    aka "at least once", so that users and runners can benefit from
    efficiency if it is possible

    Main question: is it best as a parameter to existing reshuffle
    transforms or as new URN(s)? I have proposed it as a parameter but
    I think either one could work.

    I would love feedback on the main idea, main question, or anywhere
    on the doc.

    Thanks!

    Kenn

    [1] https://s.apache.org/beam-reshuffle-allowing-duplicates

Re: [DESIGN PROPOSAL] Reshuffle Allowing Duplicates

Reply via email to