On Tue, Jan 30, 2024 at 5:22 PM Robert Burke <rob...@frantil.com> wrote:

> Is the benefit of this proposal just the bounded deviation from the
> existing reshuffle?
>
> Reshuffle is already rather dictated by arbitrary runner choice, from
> simply ignoring the node, to forcing a materialization break, to a full
> shuffle implementation which has additional side effects.
>
> But model wise I don't believe it guarantees specific checkpointing or
> re-execution behavior as currently specified. The proto only says it
> represents the operation (without specifying the behavior, that is a big
> problem).
>

Indeed, the semantics are specified for reshuffle: the output PCollection
has the same elements as the input PCollection. Beam very deliberately
doesn't define operational characteristics. It is entirely possible that
reshuffle is meaningless for a runner, indeed. I'm not particularly trying
to re-open that can of worms here...

I guess my concern here is that it implies/codifies that the existing
> reshuffle has more behavior than it promises outside of the Java SDK.
>
> "Allowing duplicates" WRT reshuffle is tricky. It feels like mostly allows
> an implementation that may mean the inputs into the reshuffle might be
> re-executed for example. But that's always under the runner's discretion ,
> and ultimately it could also prevent even getting the intended benefit of a
> reshuffle (notionally, just a fusion break).
>

My intent is to be exactly as questionable as the current reshuffle, which
is indeed questionable. The semantics of the newly proposed transform is
that the output PCollection contains the same elements as the input
PCollection, possibly with duplicates. Aka the input is a subset of the
output.

Is there even a valid way to implement the notion of a reshuffle that leads
> to duplicates outside of a retry/resilience case?
>

Sure! ParDo(x -> { output(x); output(x) })

:-) :-) :-)

Kenn


>
> -------
>
> To be clear, I'm not against the proposal. I'm against that its being
> built on a non-existent foundation. If the behavior isn't already defined,
> it's impossible to specify a real deviation from it.
>
> I'm all for more specific behaviors if means we actually clarify what the
> original version is in the protos, since its news to me ( just now, because
> I looked) that the Java reshuffle promises GBK-like side effects. But
> that's a long deprecated transform without a satisfying replacement for
> it's usage, so it may be moot.
>
> Robert Burke
>
>
>
> On Tue, Jan 30, 2024, 1:34 PM Kenneth Knowles <k...@apache.org> wrote:
>
>> Hi all,
>>
>> Just when you thought I had squeezed all the possible interest out of
>> this most boring-seeming of transforms :-)
>>
>> I wrote up a very quick proposal as a doc [1]. It is short enough that I
>> will also put the main idea and main question in this email so you can
>> quickly read. Best to put comments in the.
>>
>> Main idea: add a variation of Reshuffle that allows duplicates, aka "at
>> least once", so that users and runners can benefit from efficiency if it is
>> possible
>>
>> Main question: is it best as a parameter to existing reshuffle transforms
>> or as new URN(s)? I have proposed it as a parameter but I think either one
>> could work.
>>
>> I would love feedback on the main idea, main question, or anywhere on the
>> doc.
>>
>> Thanks!
>>
>> Kenn
>>
>> [1] https://s.apache.org/beam-reshuffle-allowing-duplicates
>>
>

Reply via email to