Re: Introducing a Redistribute transform

Amit Sela Mon, 10 Oct 2016 12:58:26 -0700

On Mon, Oct 10, 2016 at 9:21 PM Robert Bradshaw <[email protected]>
wrote:


> On Sat, Oct 8, 2016 at 7:31 AM, Amit Sela <[email protected]> wrote:
>
> > Hi Eugene,
>
> >
>
> > This is very interesting.
>
> > Let me see if I get this right, the "Redistribute"  transformation
> assigns
>
> > a "running id" key (per-bundle) , calls "Redistribute.byKey", and
> extracts
>
> > back the values, correct ?
>
>
>
> The keys are (pseudorandomly) unique per element.
>
>
>
> > As for "Redistribute.byKey" - it's made of a GroupByKey transformation
> that
>
> > follows a Window transformation that neutralises the "resolution" of
>
> > triggers and panes that usually occurs in GroupByKey, correct ?
>
> >
>
> > So this is basically a "FanOut" transformation which will depend on the
>
> > available resources of the runner (and the uniqueness of the assigned
> keys)
>
> > ?
>
> >
>
> > Would we want to Redistribute into a user-defined number of bundles (>
>
> > current) ?
>
>
>
> I don't think there's any advantage to letting the user specify a
>
> number here; the data is spread out among as many machines as are
>
> handling the shuffling (for N elements, there are ~N unique keys,
>
> which gets partitioned by the system to the M workers).
>
>
>
> > How about "FanIn" ?
>
>
>
> Could you clarify what you would hope to use this for?
>
Well, what if for some reason I would want to limit parallelism for a step
in the Pipeline ? like calling an external service without "DDoS"ing it ?

>
>
>
> > On Fri, Oct 7, 2016 at 10:49 PM Eugene Kirpichov
>
> > <[email protected]> wrote:
>
> >
>
> >> Hello,
>
> >>
>
> >> Heads up that https://github.com/apache/incubator-beam/pull/1036 will
>
> >> introduce a transform called "Redistribute", encapsulating a relatively
>
> >> common pattern - a "fusion break" [see
>
> >>
>
> >>
> https://cloud.google.com/dataflow/service/dataflow-service-desc#preventing-fusion
>
> >> previously
>
> >> providing advice on that] - useful e.g. when you write an IO as a
> sequence
>
> >> of ParDo's: split a query into parts, read each part, and you want to
>
> >> prevent fusing these ParDo's because that would make the whole thing
>
> >> execute sequentially, and in other similar cases.
>
> >>
>
> >> The PR also uses it, as an example, in DatastoreIO and JdbcIO, both of
>
> >> which used to have a hand-rolled implementation of the same. The Write
>
> >> transform has something similar, but not quite identical, so I skipped
> it.
>
> >>
>
> >> This is not a model change - merely providing a common implementation of
>
> >> something useful that already existed but was scattered across the
>
> >> codebase.
>
> >>
>
> >> Redistribute also subsumes the old mostly-internal Reshuffle transform
> via
>
> >> Redistribute.byKey().
>
> >>
>
> >> I tried finding more cases in the Beam codebase that have an ad-hoc
>
> >> implementation of this; I did not find any, but I might have missed
>
> >> something. I suppose the transform will need to be advertised in
>
> >> documentation on best-practices for connector development; perhaps some
>
> >> StackOverflow answers should be updated; any other places?
>
> >>
>
>

Re: Introducing a Redistribute transform

Reply via email to