There's a Runner Authoring Guide on the beam site as well: https://beam.apache.org/contribute/runner-guide/
On 2021/08/11 16:00:32, Robert Bradshaw <[email protected]> wrote: > You might also want to look at > https://docs.google.com/presentation/d/1Cso0XP9dmj77OD9Bd53C1M3W1sPJF0ZnA20gzb2BPhE/edit#slide=id.p > to get an overview of the SDK <-> runner <-> worker communcication > patterns. It was written from the SDKs perspective but can be helpful > to understand runners as well. > > On Wed, Aug 11, 2021 at 8:44 AM Luke Cwik <[email protected]> wrote: > > > > The execution of work via the actual underlying service and API definition > > is push based[1] since it uses a bidirectional stream between the client > > and the server allowing for the server to "push" work to the client. The > > implementation in the FnApiRunner is just one abstraction of how work > > scheduling could be done and isn't meant to really be a high performance > > distributed "runner". Typically new runners integrate their own work > > scheduler directly like Flink/Spark/Samza/Dataflow/... > > > > I took a quick pass over the integration and it may make sense to add Ray > > as a type of cluster/worker manager similar to how some runners can launch > > jobs using kubernetes as a cluster manager having docker containers that > > host the worker. This could allow other runners to use Ray as a cluster > > manager. Otherwise you could try to abstract out parts of FnApiRunner that > > are reusable and build a distributed runner powered by Ray. There are > > probably many improvements to FnApiRunner that would be welcome as well but > > turning FnApiRunner into a distributed runner is likely a stretch. > > > > Regardless of what you think of the above, many pipelines rely on caching > > keyed state across bundles (aka work items) and performance suffers > > significantly for many pipelines due to cache misses so you'll want to > > design for having work scheduling have some affinity. > > > > 1: > > https://github.com/apache/beam/blob/b9b4c6e1f142d3fcb93f6a4458e2a06282ab9ee8/model/fn-execution/src/main/proto/beam_fn_api.proto#L76 > > > > On Wed, Aug 11, 2021 at 1:16 AM Kai Fricke <[email protected]> wrote: > >> > >> Hey everyone, > >> > >> I'm one of the maintainers of the Ray project (https://ray.io) and have > >> been investigating an Apache Beam runner for Ray. Our team has seen a lot > >> of interest from users for this and we are happy to invest significant > >> time. > >> > >> I've implemented a proof of concept Ray-backed SDK worker[0]. Here I'm > >> using the existing Python Fn API Runner and replace > >> multi-threading/multi-processing SDK workers by distributed Ray workers. > >> This works well (it executes workloads on a cluster) but I'm wondering if > >> this can actually scale to large multi-node clusters. > >> > >> Further I have a couple of questions regarding the general worker > >> scheduling model, specifically if we should continue to go with a > >> polling-based worker model or if we can actually replace this with a > >> push-based model where we use Ray-native data processing APIs do schedule > >> the workload. > >> > >> At this point it would be great to get more context from someone who is > >> more familiar with the Beam abstractions and codebase, specifically the > >> Python implementation. I'd love to get on a call to chat more. > >> > >> Thanks and best wishes, > >> Kai > >> > >> [0] > >> https://github.com/krfricke/beam/tree/ray-runner-scratch/sdks/python/apache_beam/runners/ray >
