Re: Spark and Arrow Flight

Wes McKinney Wed, 10 Jul 2019 10:46:02 -0700

Of course, it might make just as much sense in Apache Spark. Probably
worth bringing up with that community, too


On Wed, Jul 10, 2019 at 12:37 PM Wes McKinney <[email protected]> wrote:
>
> hi Ryan -- I was thinking that this might be built separately from the
> main Java project. We don't have a model in the codebase yet for
> libraries that depend on the core libraries (this could be in an apps/
> directory at the top level, so apps/spark-flight-source or something).
> So the development procedure would be to build and install the Arrow
> libraries first and then build the Spark-Flight source as a follow up.
>
> I think there would be a lot of benefit to maintaining common
> development infrastructure -- for example, we could set up
> docker-compose tasks to spin up nodes to simulate a distributed system
> for testing and benchmarking purposes, and utilize common CI systems.
>
> - Wes
>
> On Wed, Jul 10, 2019 at 12:28 PM Ryan Murray <[email protected]> wrote:
> >
> > Hey Wes,
> >
> > Would be happy to! Jacques and I had originally thought to try and get it
> > into Spark but perhaps Arrow might be a better home. I think the only issue
> > is whether we want to bring Spark jars and their dependencies into Arrow.
> > One challenge I have had so far with the connector is managing the
> > transitive arrow dependencies from Spark, the connector only works on
> > relatively recent versions of Spark and potentially can create circular
> > arrow dependencies. I think this issue will be better once 1.0.0 is done
> > and we can rely on a stable format/api.
> >
> > Best,
> > Ryan
> >
> > On Tue, Jul 9, 2019 at 5:08 PM Wes McKinney <[email protected]> wrote:
> >
> > > Hi Ryan, have you thought about developing this inside Apache Arrow?
> > >
> > > On Tue, Jul 9, 2019, 5:42 PM Bryan Cutler <[email protected]> wrote:
> > >
> > > > Great, thanks Ryan! I'll take a look
> > > >
> > > > On Tue, Jul 9, 2019 at 3:31 PM Ryan Murray <[email protected]> wrote:
> > > >
> > > > > Hi Bryan,
> > > > >
> > > > > I have an implementation of option #3 nearly ready for a PR. I will
> > > > mention
> > > > > you when I publish it.
> > > > >
> > > > > The working prototype for the Spark connector is here:
> > > > > https://github.com/rymurr/flight-spark-source. It technically works
> > > (and
> > > > > is
> > > > > very fast!) however the implementation is pretty dodgy and needs to be
> > > > > cleaned up before ready for prime time. I plan to have it ready to go
> > > for
> > > > > the Arrow 1.0.0 release as an apache 2.0 licensed project. Please 
> > > > > shout
> > > > if
> > > > > you have any comments or are interested in contributing!
> > > > >
> > > > > Best,
> > > > > Ryan
> > > > >
> > > > > On Tue, Jul 9, 2019 at 3:21 PM Bryan Cutler <[email protected]> wrote:
> > > > >
> > > > > > I'm in favor of option #3 also, but not sure what the best thing to
> > > do
> > > > > with
> > > > > > the existing FlightInfo response is. I'm definitely interested in
> > > > > > connecting Spark with Flight, can you share more details of your 
> > > > > > work
> > > > or
> > > > > is
> > > > > > it planned to be open sourced?
> > > > > >
> > > > > > Thanks,
> > > > > > Bryan
> > > > > >
> > > > > > On Tue, Jul 2, 2019 at 3:35 AM Antoine Pitrou <[email protected]>
> > > > > wrote:
> > > > > >
> > > > > > >
> > > > > > > Either #3 or #4 for me.  If #3, the default GetSchema
> > > implementation
> > > > > can
> > > > > > > rely on calling GetFlightInfo.
> > > > > > >
> > > > > > >
> > > > > > > Le 01/07/2019 à 22:50, David Li a écrit :
> > > > > > > > I think I'd prefer #3 over overloading an existing call (#2).
> > > > > > > >
> > > > > > > > We've been thinking about a similar issue, where sometimes we
> > > want
> > > > > > > > just the schema, but the service can't necessarily return the
> > > > schema
> > > > > > > > without fetching data - right now we return a sentinel value in
> > > > > > > > GetFlightInfo, but a separate RPC would let us explicitly
> > > indicate
> > > > an
> > > > > > > > error.
> > > > > > > >
> > > > > > > > I might be missing something though - what happens between step 
> > > > > > > > 1
> > > > and
> > > > > > > > 2 that makes the endpoints available? Would it make sense to use
> > > > > > > > DoAction to cause the backend to "prepare" the endpoints, and
> > > have
> > > > > the
> > > > > > > > result of that be an encoded schema? So then the flow would be
> > > > > > > > DoAction -> GetFlightInfo -> DoGet.
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > David
> > > > > > > >
> > > > > > > > On 7/1/19, Wes McKinney <[email protected]> wrote:
> > > > > > > >> My inclination is either #2 or #3. #4 is an option of course,
> > > but
> > > > I
> > > > > > > >> like the more structured solution of explicitly requesting the
> > > > > schema
> > > > > > > >> given a descriptor.
> > > > > > > >>
> > > > > > > >> In both cases, it's possible that schemas are sent twice, e.g.
> > > if
> > > > > you
> > > > > > > >> call GetSchema and then later call GetFlightInfo and so you
> > > > receive
> > > > > > > >> the schema again. The schema is optional, so if it became a
> > > > > > > >> performance problem then a particular server might return the
> > > > schema
> > > > > > > >> as null from GetFlightInfo.
> > > > > > > >>
> > > > > > > >> I think it's valid to want to make a single GetFlightInfo RPC
> > > > > request
> > > > > > > >> that returns _both_ the schema and the query plan.
> > > > > > > >>
> > > > > > > >> Thoughts from others?
> > > > > > > >>
> > > > > > > >> On Fri, Jun 28, 2019 at 8:52 PM Jacques Nadeau <
> > > > [email protected]>
> > > > > > > wrote:
> > > > > > > >>>
> > > > > > > >>> My initial inclination is towards #3 but I'd be curious what
> > > > others
> > > > > > > >>> think.
> > > > > > > >>> In the case of #3, I wonder if it makes sense to then pull the
> > > > > Schema
> > > > > > > off
> > > > > > > >>> the GetFlightInfo response...
> > > > > > > >>>
> > > > > > > >>> On Fri, Jun 28, 2019 at 10:57 AM Ryan Murray <
> > > [email protected]>
> > > > > > > wrote:
> > > > > > > >>>
> > > > > > > >>>> Hi All,
> > > > > > > >>>>
> > > > > > > >>>> I have been working on building an arrow flight source for
> > > > spark.
> > > > > > The
> > > > > > > >>>> goal
> > > > > > > >>>> here is for Spark to be able to use a group of arrow flight
> > > > > > endpoints
> > > > > > > >>>> to
> > > > > > > >>>> get a dataset pulled over to spark in parallel.
> > > > > > > >>>>
> > > > > > > >>>> I am unsure of the best model for the spark <-> flight
> > > > > conversation
> > > > > > > and
> > > > > > > >>>> wanted to get your opinion on the best way to go.
> > > > > > > >>>>
> > > > > > > >>>> I am breaking up the query to flight from spark into 3 parts:
> > > > > > > >>>> 1) get the schema using GetFlightInfo. This is needed to do
> > > > > further
> > > > > > > >>>> lazy
> > > > > > > >>>> operations in Spark
> > > > > > > >>>> 2) get the endpoints by calling GetFlightInfo a 2nd time with
> > > a
> > > > > > > >>>> different
> > > > > > > >>>> argument. This returns the list endpoints on the parallel
> > > flight
> > > > > > > >>>> server.
> > > > > > > >>>> The endpoints are not available till data is ready to be
> > > > fetched,
> > > > > > > which
> > > > > > > >>>> is
> > > > > > > >>>> done after the schema but is needed before DoGet is called.
> > > > > > > >>>> 3) call get stream on all endpoints from 2
> > > > > > > >>>>
> > > > > > > >>>> I think I have to do each step however I don't like having to
> > > > call
> > > > > > > >>>> getInfo
> > > > > > > >>>> twice, it doesn't seem very elegant. I see a few options:
> > > > > > > >>>> 1) live with calling GetFlightInfo twice and with a custom
> > > bytes
> > > > > cmd
> > > > > > > to
> > > > > > > >>>> differentiate the purpose of each call
> > > > > > > >>>> 2) add an argument to GetFlightInfo to tell it its being
> > > called
> > > > > only
> > > > > > > >>>> for
> > > > > > > >>>> the schema
> > > > > > > >>>> 3) add another rpc endpoint: ie GetSchema(FlightDescriptor) 
> > > > > > > >>>> to
> > > > > > return
> > > > > > > >>>> just
> > > > > > > >>>> the Schema in question
> > > > > > > >>>> 4) use DoAction and wrap the expected FlightInfo in a Result
> > > > > > > >>>>
> > > > > > > >>>> I am aware that 4 is probably the least disruptive but I'm
> > > also
> > > > > not
> > > > > > a
> > > > > > > >>>> fan
> > > > > > > >>>> as (to me) it implies performing an action on the server 
> > > > > > > >>>> side.
> > > > > > > >>>> Suggestions
> > > > > > > >>>> 2 & 3 are larger changes and I am reluctant to do that unless
> > > > > there
> > > > > > is
> > > > > > > >>>> a
> > > > > > > >>>> consensus here. None of them are great options and I am
> > > > wondering
> > > > > > what
> > > > > > > >>>> everyone thinks the best approach might be? Particularly as I
> > > > > think
> > > > > > > this
> > > > > > > >>>> is
> > > > > > > >>>> likely to come up in more applications than just spark.
> > > > > > > >>>>
> > > > > > > >>>> Best,
> > > > > > > >>>> Ryan
> > > > > > > >>>>
> > > > > > > >>
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > >
> > > > > Ryan Murray  | Principal Consulting Engineer
> > > > >
> > > > > +447540852009 | [email protected]
> > > > >
> > > > > <https://www.dremio.com/>
> > > > > Check out our GitHub <https://www.github.com/dremio>, join our
> > > community
> > > > > site <https://community.dremio.com/> & Download Dremio
> > > > > <https://www.dremio.com/download>
> > > > >
> > > >
> > >
> >
> >
> > --
> >
> > Ryan Murray  | Principal Consulting Engineer
> >
> > +447540852009 | [email protected]
> >
> > <https://www.dremio.com/>
> > Check out our GitHub <https://www.github.com/dremio>, join our community
> > site <https://community.dremio.com/> & Download Dremio
> > <https://www.dremio.com/download>

Re: Spark and Arrow Flight

Reply via email to