I build off of the work performed by Eugene et al. within Breaking the
fusion barrier[2] and propose[1] a way of how to support splitting of
bundles (primarily for SplittableDoFn) within the portability layer. This
also builds off of a lot of past work[3, 4, 5, 6, 7] related to splitting.

Note that this proposal[1] discusses the portability API changes and
"control" flow needed. It also discusses implementation details recommended
during implementation by SDKs and runners. Interestingly, I believe there
is a way to have a limited form of dynamic work rebalancing for all
runners[8] that exist today that should be easily extensible by Runners to
provide a meaningful solution but until implemented and tried out, hard to
say what gains if any there could be.

Note that follow-up proposals/discussions about any SplittableDoFn API
changes specific to each language implementation should follow by those
interested in getting SplittableDoFn working with portability. There are a
few that are needed to support backlog reporting/splitting at backlog[6]
and also bundle finalization[9].

This topic has a lot of historical context so I apologize upfront for the
complicated read, but feel free to comment on the doc or this thread.

1: https://s.apache.org/splittable-do-fn
2: https://beam.apache.org/blog/2017/08/16/splittable-do-fn.html
3: http://s.apache.org/beam-breaking-fusion
4: http://s.apache.org/textio-sdf
5: http://s.apache.org/splittable-do-fn-python-sdk
6: https://s.apache.org/beam-bundles-backlog-splitting
7: https://s.apache.org/beam-checkpoint-and-split-bundles
8:
https://docs.google.com/document/d/1cKOB9ToasfYs1kLWQgffzvIbJx2Smy4svlodPRhFrk4/edit#heading=h.wkwslng744mv
9: https://s.apache.org/beam-finalizing-bundles

Reply via email to