Spark and Arrow Flight

Ryan Murray Fri, 28 Jun 2019 10:57:40 -0700

Hi All,

I have been working on building an arrow flight source for spark. The goal
here is for Spark to be able to use a group of arrow flight endpoints to
get a dataset pulled over to spark in parallel.


I am unsure of the best model for the spark <-> flight conversation and
wanted to get your opinion on the best way to go.

I am breaking up the query to flight from spark into 3 parts:
1) get the schema using GetFlightInfo. This is needed to do further lazy
operations in Spark
2) get the endpoints by calling GetFlightInfo a 2nd time with a different
argument. This returns the list endpoints on the parallel flight server.
The endpoints are not available till data is ready to be fetched, which is
done after the schema but is needed before DoGet is called.
3) call get stream on all endpoints from 2

I think I have to do each step however I don't like having to call getInfo
twice, it doesn't seem very elegant. I see a few options:
1) live with calling GetFlightInfo twice and with a custom bytes cmd to
differentiate the purpose of each call
2) add an argument to GetFlightInfo to tell it its being called only for
the schema
3) add another rpc endpoint: ie GetSchema(FlightDescriptor) to return just
the Schema in question
4) use DoAction and wrap the expected FlightInfo in a Result

I am aware that 4 is probably the least disruptive but I'm also not a fan
as (to me) it implies performing an action on the server side. Suggestions
2 & 3 are larger changes and I am reluctant to do that unless there is a
consensus here. None of them are great options and I am wondering what
everyone thinks the best approach might be? Particularly as I think this is
likely to come up in more applications than just spark.

Best,
Ryan

Spark and Arrow Flight

Reply via email to