On Mon, Oct 10, 2016 at 9:21 PM Robert Bradshaw <[email protected]> wrote:
> On Sat, Oct 8, 2016 at 7:31 AM, Amit Sela <[email protected]> wrote: > > > Hi Eugene, > > > > > > This is very interesting. > > > Let me see if I get this right, the "Redistribute" transformation > assigns > > > a "running id" key (per-bundle) , calls "Redistribute.byKey", and > extracts > > > back the values, correct ? > > > > The keys are (pseudorandomly) unique per element. > > > > > As for "Redistribute.byKey" - it's made of a GroupByKey transformation > that > > > follows a Window transformation that neutralises the "resolution" of > > > triggers and panes that usually occurs in GroupByKey, correct ? > > > > > > So this is basically a "FanOut" transformation which will depend on the > > > available resources of the runner (and the uniqueness of the assigned > keys) > > > ? > > > > > > Would we want to Redistribute into a user-defined number of bundles (> > > > current) ? > > > > I don't think there's any advantage to letting the user specify a > > number here; the data is spread out among as many machines as are > > handling the shuffling (for N elements, there are ~N unique keys, > > which gets partitioned by the system to the M workers). > > > > > How about "FanIn" ? > > > > Could you clarify what you would hope to use this for? > Well, what if for some reason I would want to limit parallelism for a step in the Pipeline ? like calling an external service without "DDoS"ing it ? > > > > > On Fri, Oct 7, 2016 at 10:49 PM Eugene Kirpichov > > > <[email protected]> wrote: > > > > > >> Hello, > > >> > > >> Heads up that https://github.com/apache/incubator-beam/pull/1036 will > > >> introduce a transform called "Redistribute", encapsulating a relatively > > >> common pattern - a "fusion break" [see > > >> > > >> > https://cloud.google.com/dataflow/service/dataflow-service-desc#preventing-fusion > > >> previously > > >> providing advice on that] - useful e.g. when you write an IO as a > sequence > > >> of ParDo's: split a query into parts, read each part, and you want to > > >> prevent fusing these ParDo's because that would make the whole thing > > >> execute sequentially, and in other similar cases. > > >> > > >> The PR also uses it, as an example, in DatastoreIO and JdbcIO, both of > > >> which used to have a hand-rolled implementation of the same. The Write > > >> transform has something similar, but not quite identical, so I skipped > it. > > >> > > >> This is not a model change - merely providing a common implementation of > > >> something useful that already existed but was scattered across the > > >> codebase. > > >> > > >> Redistribute also subsumes the old mostly-internal Reshuffle transform > via > > >> Redistribute.byKey(). > > >> > > >> I tried finding more cases in the Beam codebase that have an ad-hoc > > >> implementation of this; I did not find any, but I might have missed > > >> something. I suppose the transform will need to be advertised in > > >> documentation on best-practices for connector development; perhaps some > > >> StackOverflow answers should be updated; any other places? > > >> > >
