Thanks everyone for your input, we’ll draft a design doc and will reach out again once we made some progress!
> On Aug 11, 2021, at 6:00 PM, Robert Bradshaw <[email protected]> wrote: > > You might also want to look at > https://docs.google.com/presentation/d/1Cso0XP9dmj77OD9Bd53C1M3W1sPJF0ZnA20gzb2BPhE/edit#slide=id.p > to get an overview of the SDK <-> runner <-> worker communcication > patterns. It was written from the SDKs perspective but can be helpful > to understand runners as well. > > On Wed, Aug 11, 2021 at 8:44 AM Luke Cwik <[email protected]> wrote: >> >> The execution of work via the actual underlying service and API definition >> is push based[1] since it uses a bidirectional stream between the client and >> the server allowing for the server to "push" work to the client. The >> implementation in the FnApiRunner is just one abstraction of how work >> scheduling could be done and isn't meant to really be a high performance >> distributed "runner". Typically new runners integrate their own work >> scheduler directly like Flink/Spark/Samza/Dataflow/... >> >> I took a quick pass over the integration and it may make sense to add Ray as >> a type of cluster/worker manager similar to how some runners can launch jobs >> using kubernetes as a cluster manager having docker containers that host the >> worker. This could allow other runners to use Ray as a cluster manager. >> Otherwise you could try to abstract out parts of FnApiRunner that are >> reusable and build a distributed runner powered by Ray. There are probably >> many improvements to FnApiRunner that would be welcome as well but turning >> FnApiRunner into a distributed runner is likely a stretch. >> >> Regardless of what you think of the above, many pipelines rely on caching >> keyed state across bundles (aka work items) and performance suffers >> significantly for many pipelines due to cache misses so you'll want to >> design for having work scheduling have some affinity. >> >> 1: >> https://github.com/apache/beam/blob/b9b4c6e1f142d3fcb93f6a4458e2a06282ab9ee8/model/fn-execution/src/main/proto/beam_fn_api.proto#L76 >> >> On Wed, Aug 11, 2021 at 1:16 AM Kai Fricke <[email protected]> wrote: >>> >>> Hey everyone, >>> >>> I'm one of the maintainers of the Ray project (https://ray.io) and have >>> been investigating an Apache Beam runner for Ray. Our team has seen a lot >>> of interest from users for this and we are happy to invest significant time. >>> >>> I've implemented a proof of concept Ray-backed SDK worker[0]. Here I'm >>> using the existing Python Fn API Runner and replace >>> multi-threading/multi-processing SDK workers by distributed Ray workers. >>> This works well (it executes workloads on a cluster) but I'm wondering if >>> this can actually scale to large multi-node clusters. >>> >>> Further I have a couple of questions regarding the general worker >>> scheduling model, specifically if we should continue to go with a >>> polling-based worker model or if we can actually replace this with a >>> push-based model where we use Ray-native data processing APIs do schedule >>> the workload. >>> >>> At this point it would be great to get more context from someone who is >>> more familiar with the Beam abstractions and codebase, specifically the >>> Python implementation. I'd love to get on a call to chat more. >>> >>> Thanks and best wishes, >>> Kai >>> >>> [0] >>> https://github.com/krfricke/beam/tree/ray-runner-scratch/sdks/python/apache_beam/runners/ray
