Re: Spark and Arrow Flight

Wes McKinney Mon, 01 Jul 2019 14:04:06 -0700

On Mon, Jul 1, 2019 at 3:50 PM David Li <[email protected]> wrote:
>
> I think I'd prefer #3 over overloading an existing call (#2).
>
> We've been thinking about a similar issue, where sometimes we want
> just the schema, but the service can't necessarily return the schema
> without fetching data - right now we return a sentinel value in
> GetFlightInfo, but a separate RPC would let us explicitly indicate an
> error.
>
> I might be missing something though - what happens between step 1 and
> 2 that makes the endpoints available? Would it make sense to use
> DoAction to cause the backend to "prepare" the endpoints, and have the
> result of that be an encoded schema? So then the flow would be
> DoAction -> GetFlightInfo -> DoGet.


I think it depends on the particular server/planner implementation. If
preparing a dataset is expensive (imagine loading a large dataset into
a distributed cache, then dropping it later), then it might be that
you have:

DoAction: Load/Prepare $DATASET

... clients access the dataset using GetFlightInfo with path $DATASET

DoAction: Drop $DATASET

In other cases GetFlightInfo might contain a SQL query and so having a
separate DoAction workflow is not needed

>
> Best,
> David
>
> On 7/1/19, Wes McKinney <[email protected]> wrote:
> > My inclination is either #2 or #3. #4 is an option of course, but I
> > like the more structured solution of explicitly requesting the schema
> > given a descriptor.
> >
> > In both cases, it's possible that schemas are sent twice, e.g. if you
> > call GetSchema and then later call GetFlightInfo and so you receive
> > the schema again. The schema is optional, so if it became a
> > performance problem then a particular server might return the schema
> > as null from GetFlightInfo.
> >
> > I think it's valid to want to make a single GetFlightInfo RPC request
> > that returns _both_ the schema and the query plan.
> >
> > Thoughts from others?
> >
> > On Fri, Jun 28, 2019 at 8:52 PM Jacques Nadeau <[email protected]> wrote:
> >>
> >> My initial inclination is towards #3 but I'd be curious what others
> >> think.
> >> In the case of #3, I wonder if it makes sense to then pull the Schema off
> >> the GetFlightInfo response...
> >>
> >> On Fri, Jun 28, 2019 at 10:57 AM Ryan Murray <[email protected]> wrote:
> >>
> >> > Hi All,
> >> >
> >> > I have been working on building an arrow flight source for spark. The
> >> > goal
> >> > here is for Spark to be able to use a group of arrow flight endpoints
> >> > to
> >> > get a dataset pulled over to spark in parallel.
> >> >
> >> > I am unsure of the best model for the spark <-> flight conversation and
> >> > wanted to get your opinion on the best way to go.
> >> >
> >> > I am breaking up the query to flight from spark into 3 parts:
> >> > 1) get the schema using GetFlightInfo. This is needed to do further
> >> > lazy
> >> > operations in Spark
> >> > 2) get the endpoints by calling GetFlightInfo a 2nd time with a
> >> > different
> >> > argument. This returns the list endpoints on the parallel flight
> >> > server.
> >> > The endpoints are not available till data is ready to be fetched, which
> >> > is
> >> > done after the schema but is needed before DoGet is called.
> >> > 3) call get stream on all endpoints from 2
> >> >
> >> > I think I have to do each step however I don't like having to call
> >> > getInfo
> >> > twice, it doesn't seem very elegant. I see a few options:
> >> > 1) live with calling GetFlightInfo twice and with a custom bytes cmd to
> >> > differentiate the purpose of each call
> >> > 2) add an argument to GetFlightInfo to tell it its being called only
> >> > for
> >> > the schema
> >> > 3) add another rpc endpoint: ie GetSchema(FlightDescriptor) to return
> >> > just
> >> > the Schema in question
> >> > 4) use DoAction and wrap the expected FlightInfo in a Result
> >> >
> >> > I am aware that 4 is probably the least disruptive but I'm also not a
> >> > fan
> >> > as (to me) it implies performing an action on the server side.
> >> > Suggestions
> >> > 2 & 3 are larger changes and I am reluctant to do that unless there is
> >> > a
> >> > consensus here. None of them are great options and I am wondering what
> >> > everyone thinks the best approach might be? Particularly as I think this
> >> > is
> >> > likely to come up in more applications than just spark.
> >> >
> >> > Best,
> >> > Ryan
> >> >
> >

Re: Spark and Arrow Flight

Reply via email to