I have a couple of points in addition to what Robert said Runners are permitted to determine bundle sizes as appropriate to their implementation, so long as bundles are atomically committed. The contents of a PCollection are independent of the bundling of that PCollection.
Runners can process all elements within their own bundles (e.g. https://github.com/apache/beam/blob/a6810372b003adf24bdbe34ed764a6 3841af9b99/runners/flink/runner/src/main/java/org/apache/beam/runners/flink/ translation/wrappers/streaming/DoFnOperator.java#L289), the entire input data, or anywhere in between. On Wed, Jan 25, 2017 at 10:05 AM, Robert Bradshaw < rober...@google.com.invalid> wrote: > Bundles are simply the unit of commitment (retry) in the Beam SDK. > They're not really a model concept, but do leak from the > implementation into the API as it's not feasible to checkpoint every > individual process call, and this allows some state/compute/... to be > safely amortized across elements (either the results of all processed > elements in a bundle are sent downstream, or none are and the entire > bundle is retried). > > On Wed, Jan 25, 2017 at 9:36 AM, Matthew Jadczak <mn...@cam.ac.uk> wrote: > > Hi, > > > > I’m a finalist CompSci student at the University of Cambridge, and for > my final project/dissertation I am writing an implementation of the Beam > SDK in Elixir [1]. Given that the Beam project is obviously still very much > WIP, it’s still somewhat difficult to find good conceptual overviews of > parts of the system, which is crucial when translating the OOP architecture > to something completely different. However I have found many of the design > docs scattered around the JIRA and here very helpful. (Incidentally, > perhaps it would be helpful to maintain a list of them, to help any > contributors acquaint themselves with the conceptual vision of the > implementation?) > > > > One thing which I have not yet been able to work out is the significance > of “bundles” in the SDK. On the one hand, it seems that they are simply an > implementation detail, effectively a way to do micro-batch processing > efficiently, and indeed they are not mentioned at all in the original > Dataflow paper or anywhere in the Beam docs (except in passing). On the > other hand, it seems most of the key transforms in the SDK core have a > concept of bundles and operate in their terms in practice, while all > conceptually being described as just operating on elements. > > > > Do bundles have semantic meaning in the Beam Model? Are there any > guidelines as to how a given transform should split its output up into > bundles? Should any runner/SDK implementing the Model have that concept, > even when other primitives for streaming data processing including things > like efficiently transmitting individual elements between stages with > backpressure are available in the language/standard libraries? Are there > any insights here that I am missing, i.e. were problems present in early > versions of the runners solved by adding the concept of bundles? > > > > Thanks so much, > > Matt > > > > [1] http://elixir-lang.org/ >