Re: About integration of drill and arrow

Andy Grove Mon, 13 Jan 2020 16:22:24 -0800

Hi Paul,

There is a test flight server in the Arrow Java project [1] that might be a
good starting point, although I haven't used it myself. I was looking at
Arrow Flight for my Ballista Poc [2] although I don't really have time to
spend on that right now.


I'm less sure of the value of having an Arrow consumer for Drill since any
vectorized processing would already have been performed by Drill? I may be
missing something though.

Thanks,

Andy.

[1]
https://github.com/apache/arrow/tree/master/java/flight/flight-core#example-usage
[2] https://github.com/andygrove/ballista

On Mon, Jan 13, 2020 at 4:55 PM Paul Rogers <par0...@yahoo.com.invalid>
wrote:

> Hi Andy & Charles,
>
> We've discussed two ways for Drill to interface to Arrow: either as an
> input or an output:
>
> Arrow Producer --> Drill --> Arrow Consumer
>
> Given how Drill works, the easiest of the two is to create a storage
> plugin to read from an Arrow Producer, perhaps using Arrow Flight (thanks
> Andy!). We would need an Arrow producer (perhaps in Beta form?). Charles,
> since you are our data science expert, do you know of any good example use
> case here? Andy, any early adopters of Flight we could use to test an Arrow
> Flight storage plugin?
>
> We could also send data to an Arrow Consumer, but this is more work. The
> crudest prototype would be to create a new stand-alone server that speaks
> to Drill using the Drill client and to an Arrow consumer using the Flight
> protocol. The prototype server would do any required Drill-to-Arrow data
> conversion. (This is not entirely crazy, it is the approach I took way back
> when with the Jig prototype.) The hardest part might the the query: the
> Arrow consumer must first send Drill a SQL query before Drill can return
> Arrow-formatted results.
>
> Andy and Charles, do you know of any good use cases which are not served
> equally well with the JDBC, ODBC or REST APIs?
>
> Either or both of these would be a low-cost way to determine if any use
> cases exist for an Arrow-native data interchange ability. If we find
> interest, we could then decide to pull either the Arrow Flight storage
> plugin (consumer) or Arrow Flight server (producer) into the Drill code
> base.
>
> And, of course, we can do these independently of our investigations about
> using Arrow as our internal representation.
>
> Thanks,
>
> - Paul
>
>
>
>     On Monday, January 13, 2020, 11:45:29 AM PST, Andy Grove <
> andygrov...@gmail.com> wrote:
>
>  I just started working with Drill and I am a PMC member of Apache Arrow. I
> am in the process of writing my first storage plugin for Drill, and I think
> it would be interesting to build a storage plugin for the Apache Arrow
> Flight protocol as a way for Drill to query Arrow data, although I'm not
> sure what compelling use cases there are for something like this.
>
> https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/
>
>
>
> On Sun, Jan 12, 2020 at 7:15 PM Charles Givre <cgi...@gmail.com> wrote:
>
> > Hello All,
> > Glad to see this discussion happening!  I apologize for the long email.
> >
> > I thought I would contribute my .02 here, which are more strategic than
> > technical. When I first heard about Arrow a few years ago, I was very
> > excited about the promise it had for enabling data interchange. From
> what I
> > understood at the time, the main advantage of Arrow was that a user could
> > query data in platform X, then use that same data in memory in platform Y
> > without having to serialize/deserialize it.
> >
> > From a data science perspective, this interoperability would be a huge
> > win.  Consider the case of building a machine learning model from
> > complicated data.  Let's say you are trying to do this from data that
> > happens to be stored in MongoDB and CSV files.  Using conventional
> methods,
> > a data scientist would probably write python code to get the data from
> > Mongo and get it into a pandas dataframe.  Then they would do the save
> for
> > the CSV and join these data sets together, hoping that they don't blow up
> > their machine in the process.
> >
> > Once this data has been processed, they'd do a bunch of feature
> > engineering and other cleanup before the data gets piped into a machine
> > learning module like Tensorflow or Scikit-Learn for training.  Once the
> > model has been trained, they would then have to create some script to
> > recreate that pipeline to make predictions on unknown data.  This is
> > simplified a bit but you get the idea.
> >
> > At the time, i was imagining that a user could use Drill to do all the
> > data prep (IE joining the CSV with the Mongo data and associated cleanup)
> > and then passing that to python to train the model.  Likewise, the
> inverse
> > could happen that once the model is trained it could be  exported to some
> > degree such that its output could be included in queries.
> >
> > However, in the years since Arrow's debut, I've honestly not seen the
> > value of it.  Furthermore, John Omernik created a series of really
> amazing
> > integrations for Jupyter Notebook that enable a user to type a query
> into a
> > notebook cell and automatically get the results straight into a pandas
> > dataframe.
> >
> > So, where does this leave us?
> > From my optic, at the moment, I'm not convinced that Arrow offers a
> > significant performance advantage over Drill's current data
> representation.
> > (Correct me if I'm wrong here)
> > Furthermore, as is evident by this email thread, converting Drill to use
> > Arrow would be extremely complicated and likely mean rewriting a majority
> > of Drill's codebase.
> >
> > I am not convinced that the benefit would be worth the effort.  BUT...
> >
> > (Forgive me here because these questions are a bit out of my technical
> > depth on this subject), what if Drill could consume Arrow data FROM
> > external sources and/or export data TO external sources.  Something like
> a
> > storage plugin for Arrow.  My reasoning is that if the key benefit of
> Arrow
> > is the data interchange, then is it possible to add that functionality to
> > Drill without having to rewrite all of Drill?  I think Paul suggested
> > something similar to this approach in a previous thread with the title
> > "Crude but Effective Approach".  This approach would capture most of the
> > benefit without most of the cost of the conversion. Admittedly, I suspect
> > this would not be as performant as a pure Arrow approach, and I think
> there
> > would have to be some experimentation to see how well this could actually
> > work in practice.
> >
> > This is all based on the assumptions that fully integrating Arrow would
> be
> > a major undertaking, that the data exchange is the key benefit and that
> > Arrow's performance is comparable to our current implementation. If these
> > assumptions are incorrect, please let me know.
> > Best,
> > -- C
>
>

Re: About integration of drill and arrow

Reply via email to