On Sat, Oct 8, 2016 at 7:31 AM, Amit Sela <[email protected]> wrote:
> Hi Eugene,
>
> This is very interesting.
> Let me see if I get this right, the "Redistribute"  transformation assigns
> a "running id" key (per-bundle) , calls "Redistribute.byKey", and extracts
> back the values, correct ?

The keys are (pseudorandomly) unique per element.

> As for "Redistribute.byKey" - it's made of a GroupByKey transformation that
> follows a Window transformation that neutralises the "resolution" of
> triggers and panes that usually occurs in GroupByKey, correct ?
>
> So this is basically a "FanOut" transformation which will depend on the
> available resources of the runner (and the uniqueness of the assigned keys)
> ?
>
> Would we want to Redistribute into a user-defined number of bundles (>
> current) ?

I don't think there's any advantage to letting the user specify a
number here; the data is spread out among as many machines as are
handling the shuffling (for N elements, there are ~N unique keys,
which gets partitioned by the system to the M workers).

> How about "FanIn" ?

Could you clarify what you would hope to use this for?

> On Fri, Oct 7, 2016 at 10:49 PM Eugene Kirpichov
> <[email protected]> wrote:
>
>> Hello,
>>
>> Heads up that https://github.com/apache/incubator-beam/pull/1036 will
>> introduce a transform called "Redistribute", encapsulating a relatively
>> common pattern - a "fusion break" [see
>>
>> https://cloud.google.com/dataflow/service/dataflow-service-desc#preventing-fusion
>> previously
>> providing advice on that] - useful e.g. when you write an IO as a sequence
>> of ParDo's: split a query into parts, read each part, and you want to
>> prevent fusing these ParDo's because that would make the whole thing
>> execute sequentially, and in other similar cases.
>>
>> The PR also uses it, as an example, in DatastoreIO and JdbcIO, both of
>> which used to have a hand-rolled implementation of the same. The Write
>> transform has something similar, but not quite identical, so I skipped it.
>>
>> This is not a model change - merely providing a common implementation of
>> something useful that already existed but was scattered across the
>> codebase.
>>
>> Redistribute also subsumes the old mostly-internal Reshuffle transform via
>> Redistribute.byKey().
>>
>> I tried finding more cases in the Beam codebase that have an ad-hoc
>> implementation of this; I did not find any, but I might have missed
>> something. I suppose the transform will need to be advertised in
>> documentation on best-practices for connector development; perhaps some
>> StackOverflow answers should be updated; any other places?
>>

Reply via email to