[DISCUSS] Restarting the Arrow Conversation

Charles Givre Mon, 03 Jan 2022 08:00:40 -0800

Hello all, 
There was a discussion in a recently closed PR [1] with a discussion between 
z0ltrix, James Turton and a few others about integrating Drill with Apache 
Arrow and wondering why it was never done.  I'd like to share my perspective as 
someone who has been around Drill for some time but also as someone who never 
worked for MapR or Dremio.  This just represents my understanding of events as 
an outsider, and I could be wrong about some or all of this.   Please forgive 
(or correct) any inaccuracies.


When I first learned of Arrow and the idea of integrating Arrow with Drill, the 
thing that interested me the most was the ability to move data between 
platforms without having to serialize/deserialize the data.  From my 
understanding, MapR did some research and didn't find a significant performance 
advantage and hence didn't really pursue the integration.  The other side of it 
was that it would require a significant amount of work to refactor major parts 
of Drill. 

I don't know the internal politics, but this was one of the major points of 
diversion between Dremio and Drill.

With that said, there was a renewed discussion on the list [2] where Paul 
Rogers proposed what he described as a "Crude but Effective" approach to an 
Arrow integration.  

This is in the email link but here was a part of Paul's email:

> Charles, just brainstorming a bit, I think the easiest way to start is to 
> create a simple, stand-alone server that speaks Arrow to the client, and uses 
> the native Drill client to speak to Drill. The native Drill client exposes 
> Drill value vectors. One trick would be to convert Drill vectors to the Arrow 
> format. I think that data vectors are the same format. Possibly offset 
> vectors. I think Arrow went its own way with null-value (Drill's is-set) 
> vectors. So, some conversion might be a no-op, others might need to rewrite a 
> vector. Good thing, this is purely at the vector level, so would be easy to 
> write. The next issue is the one that Parth has long pointed out: Drill and 
> Arrow each have their own memory allocators. How could we share a data vector 
> between the two? The simplest initial solution is just to copy the data from 
> Drill to Arrow. Slow, but transparent to the client. A crude 
> first-approximation of the development steps:
> 
> A crude first-approximation of the development steps: 
> 1. Create the client shell server. 
> 2. Implement the Arrow client protocol. Need some way to accept a query and 
> return batches of results. 
> 3. Forward the query to Drill using the native Drill client. 
> 4. As a first pass, copy vectors from Drill to Arrow and return them to the 
> client. 
> 5. Then, solve that memory allocator problem to pass data without copying.

One point that Paul made was that these pieces are fairly discrete and could be 
implemented without refactoring major components of Drill.  Of course, this 
could be something for Drill 2.0.  At a minimum, could we take the conversation 
off of the PR and put it in the email list? ;-)

Let's discuss... All ideas are welcome!

Best,
-- C


[1]: https://github.com/apache/drill/pull/2412 
<https://github.com/apache/drill/pull/2412>
[2]: https://lists.apache.org/thread/hcmygrv8q8jyw8p57fm9qy3vw2kqfr5l 
<https://lists.apache.org/thread/hcmygrv8q8jyw8p57fm9qy3vw2kqfr5l>

[DISCUSS] Restarting the Arrow Conversation

Reply via email to