Re: Spark and Arrow Flight

David Li Mon, 01 Jul 2019 13:50:39 -0700

I think I'd prefer #3 over overloading an existing call (#2).

We've been thinking about a similar issue, where sometimes we want
just the schema, but the service can't necessarily return the schema
without fetching data - right now we return a sentinel value in
GetFlightInfo, but a separate RPC would let us explicitly indicate an
error.


I might be missing something though - what happens between step 1 and
2 that makes the endpoints available? Would it make sense to use
DoAction to cause the backend to "prepare" the endpoints, and have the
result of that be an encoded schema? So then the flow would be
DoAction -> GetFlightInfo -> DoGet.

Best,
David

On 7/1/19, Wes McKinney <[email protected]> wrote:
> My inclination is either #2 or #3. #4 is an option of course, but I
> like the more structured solution of explicitly requesting the schema
> given a descriptor.
>
> In both cases, it's possible that schemas are sent twice, e.g. if you
> call GetSchema and then later call GetFlightInfo and so you receive
> the schema again. The schema is optional, so if it became a
> performance problem then a particular server might return the schema
> as null from GetFlightInfo.
>
> I think it's valid to want to make a single GetFlightInfo RPC request
> that returns _both_ the schema and the query plan.
>
> Thoughts from others?
>
> On Fri, Jun 28, 2019 at 8:52 PM Jacques Nadeau <[email protected]> wrote:
>>
>> My initial inclination is towards #3 but I'd be curious what others
>> think.
>> In the case of #3, I wonder if it makes sense to then pull the Schema off
>> the GetFlightInfo response...
>>
>> On Fri, Jun 28, 2019 at 10:57 AM Ryan Murray <[email protected]> wrote:
>>
>> > Hi All,
>> >
>> > I have been working on building an arrow flight source for spark. The
>> > goal
>> > here is for Spark to be able to use a group of arrow flight endpoints
>> > to
>> > get a dataset pulled over to spark in parallel.
>> >
>> > I am unsure of the best model for the spark <-> flight conversation and
>> > wanted to get your opinion on the best way to go.
>> >
>> > I am breaking up the query to flight from spark into 3 parts:
>> > 1) get the schema using GetFlightInfo. This is needed to do further
>> > lazy
>> > operations in Spark
>> > 2) get the endpoints by calling GetFlightInfo a 2nd time with a
>> > different
>> > argument. This returns the list endpoints on the parallel flight
>> > server.
>> > The endpoints are not available till data is ready to be fetched, which
>> > is
>> > done after the schema but is needed before DoGet is called.
>> > 3) call get stream on all endpoints from 2
>> >
>> > I think I have to do each step however I don't like having to call
>> > getInfo
>> > twice, it doesn't seem very elegant. I see a few options:
>> > 1) live with calling GetFlightInfo twice and with a custom bytes cmd to
>> > differentiate the purpose of each call
>> > 2) add an argument to GetFlightInfo to tell it its being called only
>> > for
>> > the schema
>> > 3) add another rpc endpoint: ie GetSchema(FlightDescriptor) to return
>> > just
>> > the Schema in question
>> > 4) use DoAction and wrap the expected FlightInfo in a Result
>> >
>> > I am aware that 4 is probably the least disruptive but I'm also not a
>> > fan
>> > as (to me) it implies performing an action on the server side.
>> > Suggestions
>> > 2 & 3 are larger changes and I am reluctant to do that unless there is
>> > a
>> > consensus here. None of them are great options and I am wondering what
>> > everyone thinks the best approach might be? Particularly as I think this
>> > is
>> > likely to come up in more applications than just spark.
>> >
>> > Best,
>> > Ryan
>> >
>

Re: Spark and Arrow Flight

Reply via email to