Re: Ray-based Apache Beam runner

Robert Burke Wed, 11 Aug 2021 09:07:48 -0700

There's a Runner Authoring Guide on the beam site as well: 
https://beam.apache.org/contribute/runner-guide/


On 2021/08/11 16:00:32, Robert Bradshaw <[email protected]> wrote: 
> You might also want to look at
> https://docs.google.com/presentation/d/1Cso0XP9dmj77OD9Bd53C1M3W1sPJF0ZnA20gzb2BPhE/edit#slide=id.p
> to get an overview of the SDK <-> runner <-> worker communcication
> patterns. It was written from the SDKs perspective but can be helpful
> to understand runners as well.
> 
> On Wed, Aug 11, 2021 at 8:44 AM Luke Cwik <[email protected]> wrote:
> >
> > The execution of work via the actual underlying service and API definition 
> > is push based[1] since it uses a bidirectional stream between the client 
> > and the server allowing for the server to "push" work to the client. The 
> > implementation in the FnApiRunner is just one abstraction of how work 
> > scheduling could be done and isn't meant to really be a high performance 
> > distributed "runner". Typically new runners integrate their own work 
> > scheduler directly like Flink/Spark/Samza/Dataflow/...
> >
> > I took a quick pass over the integration and it may make sense to add Ray 
> > as a type of cluster/worker manager similar to how some runners can launch 
> > jobs using kubernetes as a cluster manager having docker containers that 
> > host the worker. This could allow other runners to use Ray as a cluster 
> > manager. Otherwise you could try to abstract out parts of FnApiRunner that 
> > are reusable and build a distributed runner powered by Ray. There are 
> > probably many improvements to FnApiRunner that would be welcome as well but 
> > turning FnApiRunner into a distributed runner is likely a stretch.
> >
> > Regardless of what you think of the above, many pipelines rely on caching 
> > keyed state across bundles (aka work items) and performance suffers 
> > significantly for many pipelines due to cache misses so you'll want to 
> > design for having work scheduling have some affinity.
> >
> > 1: 
> > https://github.com/apache/beam/blob/b9b4c6e1f142d3fcb93f6a4458e2a06282ab9ee8/model/fn-execution/src/main/proto/beam_fn_api.proto#L76
> >
> > On Wed, Aug 11, 2021 at 1:16 AM Kai Fricke <[email protected]> wrote:
> >>
> >> Hey everyone,
> >>
> >> I'm one of the maintainers of the Ray project (https://ray.io) and have 
> >> been investigating an Apache Beam runner for Ray. Our team has seen a lot 
> >> of interest from users for this and we are happy to invest significant 
> >> time.
> >>
> >> I've implemented a proof of concept Ray-backed SDK worker[0]. Here I'm 
> >> using the existing Python Fn API Runner and replace 
> >> multi-threading/multi-processing SDK workers by distributed Ray workers. 
> >> This works well (it executes workloads on a cluster) but I'm wondering if 
> >> this can actually scale to large multi-node clusters.
> >>
> >> Further I have a couple of questions regarding the general worker 
> >> scheduling model, specifically if we should continue to go with a 
> >> polling-based worker model or if we can actually replace this with a 
> >> push-based model where we use Ray-native data processing APIs do schedule 
> >> the workload.
> >>
> >> At this point it would be great to get more context from someone who is 
> >> more familiar with the Beam abstractions and codebase, specifically the 
> >> Python implementation. I'd love to get on a call to chat more.
> >>
> >> Thanks and best wishes,
> >> Kai
> >>
> >> [0] 
> >> https://github.com/krfricke/beam/tree/ray-runner-scratch/sdks/python/apache_beam/runners/ray
>

Re: Ray-based Apache Beam runner

Reply via email to