Hi all, Please take a look at the following document. It is of particular interest to runner authors [I've explicitly added some people], because the approach is not that specific to Beam and generally concerns an open problem in efficiently applying a function in a distributed fashion (*Map* in *MapReduce*, *ParDo* in Beam).
In the context of Beam, the main immediate problem addressed by this document is execution of SDFs over the Fn API; in particular, in a typical streaming runner, however I am hoping that longer-term this approach can lead to across the board efficiency improvements. https://s.apache.org/beam-breaking-fusion *We outline an approach to represent and implement dynamic splits and checkpoints when processing a bundle with an arbitrary instruction graph that potentially spans SDK harness container boundaries.* *The proposal is motivated by the need to support Splittable DoFn's over the Fn API, but ends up being more general: it unifies the treatment of splitting regular DoFn's, splittable DoFn's and the (legacy) bounded/unbounded source API, and provides more dynamic splittability than previously afforded by either of these. In particular, it makes construction-time fusion choices no longer be a limiting factor for dynamic parallelization.* *The approach is largely complete, and fully addresses the immediately pressing case of supporting a single SDF running in an SDK harness controlled by a streaming runner, however several aspects of the general-case solution need further research.*