Of course, it might make just as much sense in Apache Spark. Probably worth bringing up with that community, too
On Wed, Jul 10, 2019 at 12:37 PM Wes McKinney <wesmck...@gmail.com> wrote: > > hi Ryan -- I was thinking that this might be built separately from the > main Java project. We don't have a model in the codebase yet for > libraries that depend on the core libraries (this could be in an apps/ > directory at the top level, so apps/spark-flight-source or something). > So the development procedure would be to build and install the Arrow > libraries first and then build the Spark-Flight source as a follow up. > > I think there would be a lot of benefit to maintaining common > development infrastructure -- for example, we could set up > docker-compose tasks to spin up nodes to simulate a distributed system > for testing and benchmarking purposes, and utilize common CI systems. > > - Wes > > On Wed, Jul 10, 2019 at 12:28 PM Ryan Murray <rym...@dremio.com> wrote: > > > > Hey Wes, > > > > Would be happy to! Jacques and I had originally thought to try and get it > > into Spark but perhaps Arrow might be a better home. I think the only issue > > is whether we want to bring Spark jars and their dependencies into Arrow. > > One challenge I have had so far with the connector is managing the > > transitive arrow dependencies from Spark, the connector only works on > > relatively recent versions of Spark and potentially can create circular > > arrow dependencies. I think this issue will be better once 1.0.0 is done > > and we can rely on a stable format/api. > > > > Best, > > Ryan > > > > On Tue, Jul 9, 2019 at 5:08 PM Wes McKinney <wesmck...@gmail.com> wrote: > > > > > Hi Ryan, have you thought about developing this inside Apache Arrow? > > > > > > On Tue, Jul 9, 2019, 5:42 PM Bryan Cutler <cutl...@gmail.com> wrote: > > > > > > > Great, thanks Ryan! I'll take a look > > > > > > > > On Tue, Jul 9, 2019 at 3:31 PM Ryan Murray <rym...@dremio.com> wrote: > > > > > > > > > Hi Bryan, > > > > > > > > > > I have an implementation of option #3 nearly ready for a PR. I will > > > > mention > > > > > you when I publish it. > > > > > > > > > > The working prototype for the Spark connector is here: > > > > > https://github.com/rymurr/flight-spark-source. It technically works > > > (and > > > > > is > > > > > very fast!) however the implementation is pretty dodgy and needs to be > > > > > cleaned up before ready for prime time. I plan to have it ready to go > > > for > > > > > the Arrow 1.0.0 release as an apache 2.0 licensed project. Please > > > > > shout > > > > if > > > > > you have any comments or are interested in contributing! > > > > > > > > > > Best, > > > > > Ryan > > > > > > > > > > On Tue, Jul 9, 2019 at 3:21 PM Bryan Cutler <cutl...@gmail.com> wrote: > > > > > > > > > > > I'm in favor of option #3 also, but not sure what the best thing to > > > do > > > > > with > > > > > > the existing FlightInfo response is. I'm definitely interested in > > > > > > connecting Spark with Flight, can you share more details of your > > > > > > work > > > > or > > > > > is > > > > > > it planned to be open sourced? > > > > > > > > > > > > Thanks, > > > > > > Bryan > > > > > > > > > > > > On Tue, Jul 2, 2019 at 3:35 AM Antoine Pitrou <anto...@python.org> > > > > > wrote: > > > > > > > > > > > > > > > > > > > > Either #3 or #4 for me. If #3, the default GetSchema > > > implementation > > > > > can > > > > > > > rely on calling GetFlightInfo. > > > > > > > > > > > > > > > > > > > > > Le 01/07/2019 à 22:50, David Li a écrit : > > > > > > > > I think I'd prefer #3 over overloading an existing call (#2). > > > > > > > > > > > > > > > > We've been thinking about a similar issue, where sometimes we > > > want > > > > > > > > just the schema, but the service can't necessarily return the > > > > schema > > > > > > > > without fetching data - right now we return a sentinel value in > > > > > > > > GetFlightInfo, but a separate RPC would let us explicitly > > > indicate > > > > an > > > > > > > > error. > > > > > > > > > > > > > > > > I might be missing something though - what happens between step > > > > > > > > 1 > > > > and > > > > > > > > 2 that makes the endpoints available? Would it make sense to use > > > > > > > > DoAction to cause the backend to "prepare" the endpoints, and > > > have > > > > > the > > > > > > > > result of that be an encoded schema? So then the flow would be > > > > > > > > DoAction -> GetFlightInfo -> DoGet. > > > > > > > > > > > > > > > > Best, > > > > > > > > David > > > > > > > > > > > > > > > > On 7/1/19, Wes McKinney <wesmck...@gmail.com> wrote: > > > > > > > >> My inclination is either #2 or #3. #4 is an option of course, > > > but > > > > I > > > > > > > >> like the more structured solution of explicitly requesting the > > > > > schema > > > > > > > >> given a descriptor. > > > > > > > >> > > > > > > > >> In both cases, it's possible that schemas are sent twice, e.g. > > > if > > > > > you > > > > > > > >> call GetSchema and then later call GetFlightInfo and so you > > > > receive > > > > > > > >> the schema again. The schema is optional, so if it became a > > > > > > > >> performance problem then a particular server might return the > > > > schema > > > > > > > >> as null from GetFlightInfo. > > > > > > > >> > > > > > > > >> I think it's valid to want to make a single GetFlightInfo RPC > > > > > request > > > > > > > >> that returns _both_ the schema and the query plan. > > > > > > > >> > > > > > > > >> Thoughts from others? > > > > > > > >> > > > > > > > >> On Fri, Jun 28, 2019 at 8:52 PM Jacques Nadeau < > > > > jacq...@apache.org> > > > > > > > wrote: > > > > > > > >>> > > > > > > > >>> My initial inclination is towards #3 but I'd be curious what > > > > others > > > > > > > >>> think. > > > > > > > >>> In the case of #3, I wonder if it makes sense to then pull the > > > > > Schema > > > > > > > off > > > > > > > >>> the GetFlightInfo response... > > > > > > > >>> > > > > > > > >>> On Fri, Jun 28, 2019 at 10:57 AM Ryan Murray < > > > rym...@dremio.com> > > > > > > > wrote: > > > > > > > >>> > > > > > > > >>>> Hi All, > > > > > > > >>>> > > > > > > > >>>> I have been working on building an arrow flight source for > > > > spark. > > > > > > The > > > > > > > >>>> goal > > > > > > > >>>> here is for Spark to be able to use a group of arrow flight > > > > > > endpoints > > > > > > > >>>> to > > > > > > > >>>> get a dataset pulled over to spark in parallel. > > > > > > > >>>> > > > > > > > >>>> I am unsure of the best model for the spark <-> flight > > > > > conversation > > > > > > > and > > > > > > > >>>> wanted to get your opinion on the best way to go. > > > > > > > >>>> > > > > > > > >>>> I am breaking up the query to flight from spark into 3 parts: > > > > > > > >>>> 1) get the schema using GetFlightInfo. This is needed to do > > > > > further > > > > > > > >>>> lazy > > > > > > > >>>> operations in Spark > > > > > > > >>>> 2) get the endpoints by calling GetFlightInfo a 2nd time with > > > a > > > > > > > >>>> different > > > > > > > >>>> argument. This returns the list endpoints on the parallel > > > flight > > > > > > > >>>> server. > > > > > > > >>>> The endpoints are not available till data is ready to be > > > > fetched, > > > > > > > which > > > > > > > >>>> is > > > > > > > >>>> done after the schema but is needed before DoGet is called. > > > > > > > >>>> 3) call get stream on all endpoints from 2 > > > > > > > >>>> > > > > > > > >>>> I think I have to do each step however I don't like having to > > > > call > > > > > > > >>>> getInfo > > > > > > > >>>> twice, it doesn't seem very elegant. I see a few options: > > > > > > > >>>> 1) live with calling GetFlightInfo twice and with a custom > > > bytes > > > > > cmd > > > > > > > to > > > > > > > >>>> differentiate the purpose of each call > > > > > > > >>>> 2) add an argument to GetFlightInfo to tell it its being > > > called > > > > > only > > > > > > > >>>> for > > > > > > > >>>> the schema > > > > > > > >>>> 3) add another rpc endpoint: ie GetSchema(FlightDescriptor) > > > > > > > >>>> to > > > > > > return > > > > > > > >>>> just > > > > > > > >>>> the Schema in question > > > > > > > >>>> 4) use DoAction and wrap the expected FlightInfo in a Result > > > > > > > >>>> > > > > > > > >>>> I am aware that 4 is probably the least disruptive but I'm > > > also > > > > > not > > > > > > a > > > > > > > >>>> fan > > > > > > > >>>> as (to me) it implies performing an action on the server > > > > > > > >>>> side. > > > > > > > >>>> Suggestions > > > > > > > >>>> 2 & 3 are larger changes and I am reluctant to do that unless > > > > > there > > > > > > is > > > > > > > >>>> a > > > > > > > >>>> consensus here. None of them are great options and I am > > > > wondering > > > > > > what > > > > > > > >>>> everyone thinks the best approach might be? Particularly as I > > > > > think > > > > > > > this > > > > > > > >>>> is > > > > > > > >>>> likely to come up in more applications than just spark. > > > > > > > >>>> > > > > > > > >>>> Best, > > > > > > > >>>> Ryan > > > > > > > >>>> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > Ryan Murray | Principal Consulting Engineer > > > > > > > > > > +447540852009 | rym...@dremio.com > > > > > > > > > > <https://www.dremio.com/> > > > > > Check out our GitHub <https://www.github.com/dremio>, join our > > > community > > > > > site <https://community.dremio.com/> & Download Dremio > > > > > <https://www.dremio.com/download> > > > > > > > > > > > > > > > > > > -- > > > > Ryan Murray | Principal Consulting Engineer > > > > +447540852009 | rym...@dremio.com > > > > <https://www.dremio.com/> > > Check out our GitHub <https://www.github.com/dremio>, join our community > > site <https://community.dremio.com/> & Download Dremio > > <https://www.dremio.com/download>