On Mon, Jul 1, 2019 at 3:50 PM David Li <li.david...@gmail.com> wrote: > > I think I'd prefer #3 over overloading an existing call (#2). > > We've been thinking about a similar issue, where sometimes we want > just the schema, but the service can't necessarily return the schema > without fetching data - right now we return a sentinel value in > GetFlightInfo, but a separate RPC would let us explicitly indicate an > error. > > I might be missing something though - what happens between step 1 and > 2 that makes the endpoints available? Would it make sense to use > DoAction to cause the backend to "prepare" the endpoints, and have the > result of that be an encoded schema? So then the flow would be > DoAction -> GetFlightInfo -> DoGet.
I think it depends on the particular server/planner implementation. If preparing a dataset is expensive (imagine loading a large dataset into a distributed cache, then dropping it later), then it might be that you have: DoAction: Load/Prepare $DATASET ... clients access the dataset using GetFlightInfo with path $DATASET DoAction: Drop $DATASET In other cases GetFlightInfo might contain a SQL query and so having a separate DoAction workflow is not needed > > Best, > David > > On 7/1/19, Wes McKinney <wesmck...@gmail.com> wrote: > > My inclination is either #2 or #3. #4 is an option of course, but I > > like the more structured solution of explicitly requesting the schema > > given a descriptor. > > > > In both cases, it's possible that schemas are sent twice, e.g. if you > > call GetSchema and then later call GetFlightInfo and so you receive > > the schema again. The schema is optional, so if it became a > > performance problem then a particular server might return the schema > > as null from GetFlightInfo. > > > > I think it's valid to want to make a single GetFlightInfo RPC request > > that returns _both_ the schema and the query plan. > > > > Thoughts from others? > > > > On Fri, Jun 28, 2019 at 8:52 PM Jacques Nadeau <jacq...@apache.org> wrote: > >> > >> My initial inclination is towards #3 but I'd be curious what others > >> think. > >> In the case of #3, I wonder if it makes sense to then pull the Schema off > >> the GetFlightInfo response... > >> > >> On Fri, Jun 28, 2019 at 10:57 AM Ryan Murray <rym...@dremio.com> wrote: > >> > >> > Hi All, > >> > > >> > I have been working on building an arrow flight source for spark. The > >> > goal > >> > here is for Spark to be able to use a group of arrow flight endpoints > >> > to > >> > get a dataset pulled over to spark in parallel. > >> > > >> > I am unsure of the best model for the spark <-> flight conversation and > >> > wanted to get your opinion on the best way to go. > >> > > >> > I am breaking up the query to flight from spark into 3 parts: > >> > 1) get the schema using GetFlightInfo. This is needed to do further > >> > lazy > >> > operations in Spark > >> > 2) get the endpoints by calling GetFlightInfo a 2nd time with a > >> > different > >> > argument. This returns the list endpoints on the parallel flight > >> > server. > >> > The endpoints are not available till data is ready to be fetched, which > >> > is > >> > done after the schema but is needed before DoGet is called. > >> > 3) call get stream on all endpoints from 2 > >> > > >> > I think I have to do each step however I don't like having to call > >> > getInfo > >> > twice, it doesn't seem very elegant. I see a few options: > >> > 1) live with calling GetFlightInfo twice and with a custom bytes cmd to > >> > differentiate the purpose of each call > >> > 2) add an argument to GetFlightInfo to tell it its being called only > >> > for > >> > the schema > >> > 3) add another rpc endpoint: ie GetSchema(FlightDescriptor) to return > >> > just > >> > the Schema in question > >> > 4) use DoAction and wrap the expected FlightInfo in a Result > >> > > >> > I am aware that 4 is probably the least disruptive but I'm also not a > >> > fan > >> > as (to me) it implies performing an action on the server side. > >> > Suggestions > >> > 2 & 3 are larger changes and I am reluctant to do that unless there is > >> > a > >> > consensus here. None of them are great options and I am wondering what > >> > everyone thinks the best approach might be? Particularly as I think this > >> > is > >> > likely to come up in more applications than just spark. > >> > > >> > Best, > >> > Ryan > >> > > >