Re: Horizontal scaling design suggestion: Apache arrow flight

Ryan Murray Fri, 18 Oct 2019 11:22:38 -0700

Hey Vinay,

This Spark source might be of interest [1]. We had discussed the
possibility of it being moved into Arrow proper as a contrib module when
more stable.

This is doing something similar to what you are suggesting: talking to a
cluster of Flight servers from Spark. This deals more with the client side
and less with the server side however. It talks to a single Flight
'coordinator' and uses getSchema/getFlightInfo to tell the coordinator it
wants a particular dataset. The coordinator then gives a list of flight
tickets with portions of the required datasets. A client can a) ask for the
entire dataset from the coordinator b) iterate serially through the tickets
and assemble the whole dataset on the client side or (in the case of the
Spark connector) fetch tickets in parallel.

I think the server side as you described above doesn't yet exist in a
standalone form although the spark connector was developed in conjunction
with [2] as the server. This is however highly dependent on the
implementation details of the Dremio engine as it is taking care of the
coordination between the flight workers. The idea is identical to yours
however: a coordinator engine, a distributed store for engine meta, worker
engines which create/serve the Arrow buffers.

Would be happy to discuss further if you are interested in working on this
stuff!

Best,
Ryan

[1] https://github.com/rymurr/flight-spark-source
[2] https://github.com/dremio-hub/dremio-flight-connector

On Fri, Oct 18, 2019 at 3:05 PM Vinay Kesarwani <[email protected]>
wrote:

> Hi,
>
> I am trying to establish following architecture
>
> My approach for flight horizontal scaling is to launch
> 1-Apache flight server in each node
> 2-one node declared as coordinator
> 3-Publish coordinator info to a shared service [zookeeper]
> 4-Launch worker node --> get coordinator node info from [zookeeper]
> 5-Worker publishes its info to [zookeeper] to consumed by others
>
> Client connects to coordinator:
> 1- Calls getFlightInfo(desc)
> 2-Here Co-coordinator node overrides getFlightInfo()
> 3-getFlightInfo() method internally get worker info based on the descriptor
> from zookeeper
> 4-Client consumes data from each endpoint in iterative manner OR in
> parallel[not sure how]
> -->getData()
>
> PutData()
> 5-Client calls putdata() to put data in different nodes in flight stream
> 6-Iterate through the endpoints and matches worker node IP
> 7-if Worker IP matches with endpoint; worker put data in that node flight
> server.
> 8-On putting any new stream/updated; worker node info is updated in
> zookeeper
> 9-In case worker IP doesn't match with the endpoint we need to put data in
> any other worker node; and publish the info in zookeeper.
>
> [in future distributed-client and distributed end point] example: spark
> workers to Apache arrow flight cluster
>
> [image: image]
> <
> https://user-images.githubusercontent.com/6141965/67092386-b0012c00-f1cc-11e9-9ce2-d657001a85f7.png
> >
>
> Just wanted to discuss if any PR is in progress for horizontal scaling in
> Arrow flight, or any design doc is under discussion.
>

-- 

Ryan Murray  | Principal Consulting Engineer

+447540852009 | [email protected]

<https://www.dremio.com/>
Check out our GitHub <https://www.github.com/dremio>, join our community
site <https://community.dremio.com/> & Download Dremio
<https://www.dremio.com/download>

Re: Horizontal scaling design suggestion: Apache arrow flight

Reply via email to