Hi Andy & Charles,

We've discussed two ways for Drill to interface to Arrow: either as an input or 
an output:

Arrow Producer --> Drill --> Arrow Consumer

Given how Drill works, the easiest of the two is to create a storage plugin to 
read from an Arrow Producer, perhaps using Arrow Flight (thanks Andy!). We 
would need an Arrow producer (perhaps in Beta form?). Charles, since you are 
our data science expert, do you know of any good example use case here? Andy, 
any early adopters of Flight we could use to test an Arrow Flight storage 
plugin?

We could also send data to an Arrow Consumer, but this is more work. The 
crudest prototype would be to create a new stand-alone server that speaks to 
Drill using the Drill client and to an Arrow consumer using the Flight 
protocol. The prototype server would do any required Drill-to-Arrow data 
conversion. (This is not entirely crazy, it is the approach I took way back 
when with the Jig prototype.) The hardest part might the the query: the Arrow 
consumer must first send Drill a SQL query before Drill can return 
Arrow-formatted results.

Andy and Charles, do you know of any good use cases which are not served 
equally well with the JDBC, ODBC or REST APIs?

Either or both of these would be a low-cost way to determine if any use cases 
exist for an Arrow-native data interchange ability. If we find interest, we 
could then decide to pull either the Arrow Flight storage plugin (consumer) or 
Arrow Flight server (producer) into the Drill code base.

And, of course, we can do these independently of our investigations about using 
Arrow as our internal representation.

Thanks,

- Paul

 

    On Monday, January 13, 2020, 11:45:29 AM PST, Andy Grove 
<andygrov...@gmail.com> wrote:  
 
 I just started working with Drill and I am a PMC member of Apache Arrow. I
am in the process of writing my first storage plugin for Drill, and I think
it would be interesting to build a storage plugin for the Apache Arrow
Flight protocol as a way for Drill to query Arrow data, although I'm not
sure what compelling use cases there are for something like this.

https://arrow.apache.org/blog/2019/10/13/introducing-arrow-flight/



On Sun, Jan 12, 2020 at 7:15 PM Charles Givre <cgi...@gmail.com> wrote:

> Hello All,
> Glad to see this discussion happening!  I apologize for the long email.
>
> I thought I would contribute my .02 here, which are more strategic than
> technical. When I first heard about Arrow a few years ago, I was very
> excited about the promise it had for enabling data interchange. From what I
> understood at the time, the main advantage of Arrow was that a user could
> query data in platform X, then use that same data in memory in platform Y
> without having to serialize/deserialize it.
>
> From a data science perspective, this interoperability would be a huge
> win.  Consider the case of building a machine learning model from
> complicated data.  Let's say you are trying to do this from data that
> happens to be stored in MongoDB and CSV files.  Using conventional methods,
> a data scientist would probably write python code to get the data from
> Mongo and get it into a pandas dataframe.  Then they would do the save for
> the CSV and join these data sets together, hoping that they don't blow up
> their machine in the process.
>
> Once this data has been processed, they'd do a bunch of feature
> engineering and other cleanup before the data gets piped into a machine
> learning module like Tensorflow or Scikit-Learn for training.  Once the
> model has been trained, they would then have to create some script to
> recreate that pipeline to make predictions on unknown data.  This is
> simplified a bit but you get the idea.
>
> At the time, i was imagining that a user could use Drill to do all the
> data prep (IE joining the CSV with the Mongo data and associated cleanup)
> and then passing that to python to train the model.  Likewise, the inverse
> could happen that once the model is trained it could be  exported to some
> degree such that its output could be included in queries.
>
> However, in the years since Arrow's debut, I've honestly not seen the
> value of it.  Furthermore, John Omernik created a series of really amazing
> integrations for Jupyter Notebook that enable a user to type a query into a
> notebook cell and automatically get the results straight into a pandas
> dataframe.
>
> So, where does this leave us?
> From my optic, at the moment, I'm not convinced that Arrow offers a
> significant performance advantage over Drill's current data representation.
> (Correct me if I'm wrong here)
> Furthermore, as is evident by this email thread, converting Drill to use
> Arrow would be extremely complicated and likely mean rewriting a majority
> of Drill's codebase.
>
> I am not convinced that the benefit would be worth the effort.  BUT...
>
> (Forgive me here because these questions are a bit out of my technical
> depth on this subject), what if Drill could consume Arrow data FROM
> external sources and/or export data TO external sources.  Something like a
> storage plugin for Arrow.  My reasoning is that if the key benefit of Arrow
> is the data interchange, then is it possible to add that functionality to
> Drill without having to rewrite all of Drill?  I think Paul suggested
> something similar to this approach in a previous thread with the title
> "Crude but Effective Approach".  This approach would capture most of the
> benefit without most of the cost of the conversion. Admittedly, I suspect
> this would not be as performant as a pure Arrow approach, and I think there
> would have to be some experimentation to see how well this could actually
> work in practice.
>
> This is all based on the assumptions that fully integrating Arrow would be
> a major undertaking, that the data exchange is the key benefit and that
> Arrow's performance is comparable to our current implementation. If these
> assumptions are incorrect, please let me know.
> Best,
> -- C

  

Reply via email to