Hi all,

Please take a look at the following document. It is of particular interest
to runner authors [I've explicitly added some people], because the approach
is not that specific to Beam and generally concerns an open problem in
efficiently applying a function in a distributed fashion (*Map* in
*MapReduce*, *ParDo* in Beam).

In the context of Beam, the main immediate problem addressed by this
document is execution of SDFs over the Fn API; in particular, in a typical
streaming runner, however I am hoping that longer-term this approach can
lead to across the board efficiency improvements.


*We outline an approach to represent and implement dynamic splits and
checkpoints when processing a bundle with an arbitrary instruction graph
that potentially spans SDK harness container boundaries.*

*The proposal is motivated by the need to support Splittable DoFn's over
the Fn API, but ends up being more general: it unifies the treatment of
splitting regular DoFn's, splittable DoFn's and the (legacy)
bounded/unbounded source API, and provides more dynamic splittability than
previously afforded by either of these. In particular, it makes
construction-time fusion choices no longer be a limiting factor for dynamic

*The approach is largely complete, and fully addresses the immediately
pressing case of supporting a single SDF running in an SDK harness
controlled by a streaming runner, however several aspects of the
general-case solution need further research.*

Reply via email to