Re: Ray-based Apache Beam runner

Kai Fricke Mon, 16 Aug 2021 01:49:47 -0700

Thanks everyone for your input, we’ll draft a design doc and will reach out 
again once we made some progress!


> On Aug 11, 2021, at 6:00 PM, Robert Bradshaw <[email protected]> wrote:
> 
> You might also want to look at
> https://docs.google.com/presentation/d/1Cso0XP9dmj77OD9Bd53C1M3W1sPJF0ZnA20gzb2BPhE/edit#slide=id.p
> to get an overview of the SDK <-> runner <-> worker communcication
> patterns. It was written from the SDKs perspective but can be helpful
> to understand runners as well.
> 
> On Wed, Aug 11, 2021 at 8:44 AM Luke Cwik <[email protected]> wrote:
>> 
>> The execution of work via the actual underlying service and API definition 
>> is push based[1] since it uses a bidirectional stream between the client and 
>> the server allowing for the server to "push" work to the client. The 
>> implementation in the FnApiRunner is just one abstraction of how work 
>> scheduling could be done and isn't meant to really be a high performance 
>> distributed "runner". Typically new runners integrate their own work 
>> scheduler directly like Flink/Spark/Samza/Dataflow/...
>> 
>> I took a quick pass over the integration and it may make sense to add Ray as 
>> a type of cluster/worker manager similar to how some runners can launch jobs 
>> using kubernetes as a cluster manager having docker containers that host the 
>> worker. This could allow other runners to use Ray as a cluster manager. 
>> Otherwise you could try to abstract out parts of FnApiRunner that are 
>> reusable and build a distributed runner powered by Ray. There are probably 
>> many improvements to FnApiRunner that would be welcome as well but turning 
>> FnApiRunner into a distributed runner is likely a stretch.
>> 
>> Regardless of what you think of the above, many pipelines rely on caching 
>> keyed state across bundles (aka work items) and performance suffers 
>> significantly for many pipelines due to cache misses so you'll want to 
>> design for having work scheduling have some affinity.
>> 
>> 1: 
>> https://github.com/apache/beam/blob/b9b4c6e1f142d3fcb93f6a4458e2a06282ab9ee8/model/fn-execution/src/main/proto/beam_fn_api.proto#L76
>> 
>> On Wed, Aug 11, 2021 at 1:16 AM Kai Fricke <[email protected]> wrote:
>>> 
>>> Hey everyone,
>>> 
>>> I'm one of the maintainers of the Ray project (https://ray.io) and have 
>>> been investigating an Apache Beam runner for Ray. Our team has seen a lot 
>>> of interest from users for this and we are happy to invest significant time.
>>> 
>>> I've implemented a proof of concept Ray-backed SDK worker[0]. Here I'm 
>>> using the existing Python Fn API Runner and replace 
>>> multi-threading/multi-processing SDK workers by distributed Ray workers. 
>>> This works well (it executes workloads on a cluster) but I'm wondering if 
>>> this can actually scale to large multi-node clusters.
>>> 
>>> Further I have a couple of questions regarding the general worker 
>>> scheduling model, specifically if we should continue to go with a 
>>> polling-based worker model or if we can actually replace this with a 
>>> push-based model where we use Ray-native data processing APIs do schedule 
>>> the workload.
>>> 
>>> At this point it would be great to get more context from someone who is 
>>> more familiar with the Beam abstractions and codebase, specifically the 
>>> Python implementation. I'd love to get on a call to chat more.
>>> 
>>> Thanks and best wishes,
>>> Kai
>>> 
>>> [0] 
>>> https://github.com/krfricke/beam/tree/ray-runner-scratch/sdks/python/apache_beam/runners/ray

Re: Ray-based Apache Beam runner

Reply via email to