Re: [DISCUSS] Restarting the Arrow Conversation

Charles Givre Mon, 03 Jan 2022 17:15:24 -0800

@Paul, 
Do you mind if I copy the contents of your response to DRILL-8088 to this 
thread?   There's a lot of good info there, and I'd hate to see it get lost.
-- C


> On Jan 3, 2022, at 7:41 PM, Paul Rogers <par0...@gmail.com> wrote:
> 
> Hi All,
> 
> Thanks Charles for dredging up that old discussion, your memory is better
> than mine! And, thanks Ted for that summary of MapR history. As one of the
> "replacement crew" brought in after the original folks left, your
> description is consistent with my memory of events. Moreover, as we looked
> at what was needed to run Drill in production, an Arrow port was far down
> on the list: it would not have solved actual customer problems.
> 
> Before we get too excited about Arrow, I think we should have a discussion
> about what we want in an internal storage format. I added a long (sorry)
> set of comments in that PR that Charles mentioned that tries to debunk the
> myths that have grown up around using a columnar format as the internal
> representation for a query engine. (Columnar is great for storage.) The
> note presents the many issues we've encountered over the years that have
> caused us to layer ever more code on top of vectors to solve various
> problems. It also highlights a distributed-systems problem which vectors
> make far worse.
> 
> Arrow is meant to be portable, as Ted discussed, but it is still columnar,
> and this is the source of endless problems in an execution engine. So, we
> want to ask, what is the optimal format for what Drill actually does? I'm
> now of the opinion that Drill might actually better benefit  from a
> row-based format, similar to what Impala uses. The notes even paint a path
> forward.
> 
> Ted's description of the goal for Demio suggests that Arrow might be the
> right answer for that market. Drill, however, tends to be used to query
> myriad data sources at scale and as a "query integrator" across systems.
> This use case has different needs, which may be better served with a
> row-based format.
> 
> The upshot is that "value vectors vs. Arrow" is the wrong place to start
> the discussion. The right place is "what does our many years of experience
> with Drill suggest is the most efficient format for how Drill is actually
> used?"
> 
> Note that Drill could have an Arrow-based API independent of the internal
> format. The quote from Charles explains how we could do that.
> 
> Thanks,
> 
> - Paul
> 
> On Mon, Jan 3, 2022 at 12:54 PM Ted Dunning <ted.dunn...@gmail.com> wrote:
> 
>> Christian,
>> 
>> Your thoughts are very helpful. I find Arrow very nice (I use it in Agstack
>> with Julia and Python).
>> 
>> I don't think anybody is saying that Drill wouldn't be well set with a
>> switch to Arrow or even just interfaces to Arrow. But it is a lot of work
>> to make it all happen.
>> 
>> 
>> 
>> On Mon, Jan 3, 2022 at 11:37 AM Z0ltrix <z0lt...@pm.me.invalid> wrote:
>> 
>>> Hi Charles, Ted, and the others here,
>>> 
>>> it is very interesting to hear the evolution of Drill, Dremio and Arrow
>> in
>>> that context and thank you Charles for restarting that discussion.
>>> 
>>> I think, and James mentioned this in the PR as well, that Drill could
>>> benefit from the continues progress, the Arrow project has made since its
>>> separation from Drill. And the arrow Community seems to be large, so i
>>> assume this goes on and on with improvements, new features, etc. but i
>> have
>>> not enough experience in Drill internals to have an Idea in which mass of
>>> refactoring this would lead.
>>> 
>>> In addition to that, im not aware of the current roadmap of Arrow and if
>>> these would fit into Drills roadmap. Maybe Arrow would go into a
>> different
>>> direction than Drill and what should we do, if Drill is bound to Arrow
>> then?
>>> 
>>> On the other hand, Arrow could help Drill to a wider adoption with
>> clients
>>> like pyarrow, arrow-flight, various other programming languages etc. and
>>> (im not sure about that) maybe its a performance benefit if Drill use
>> Arrow
>>> to read Data from HDFS(example), useses Arrow to work with it during
>>> execution and gives the vectors directly to my Python(example) programm
>> via
>>> arrow-flight so that i can Play around with Pandas, etc.
>>> 
>>> Just some thoughts i have since i have used Dremio with pyarrow and Drill
>>> with odbc connections.
>>> 
>>> Regards
>>> Christian
>>> -------- Original-Nachricht --------
>>> Am 3. Jan. 2022, 20:08, Charles Givre schrieb:
>>> 
>>> 
>>> Thanks Ted for the perspective! I had always wished to be a "fly on the
>>> wall" in those conversations. :-)
>>> -- C
>>> 
>>>> On Jan 3, 2022, at 11:00 AM, Charles Givre <cgi...@gmail.com> wrote:
>>>> 
>>>> Hello all,
>>>> There was a discussion in a recently closed PR [1] with a discussion
>>> between z0ltrix, James Turton and a few others about integrating Drill
>> with
>>> Apache Arrow and wondering why it was never done. I'd like to share my
>>> perspective as someone who has been around Drill for some time but also
>> as
>>> someone who never worked for MapR or Dremio. This just represents my
>>> understanding of events as an outsider, and I could be wrong about some
>> or
>>> all of this. Please forgive (or correct) any inaccuracies.
>>>> 
>>>> When I first learned of Arrow and the idea of integrating Arrow with
>>> Drill, the thing that interested me the most was the ability to move data
>>> between platforms without having to serialize/deserialize the data. From
>> my
>>> understanding, MapR did some research and didn't find a significant
>>> performance advantage and hence didn't really pursue the integration. The
>>> other side of it was that it would require a significant amount of work
>> to
>>> refactor major parts of Drill.
>>>> 
>>>> I don't know the internal politics, but this was one of the major
>> points
>>> of diversion between Dremio and Drill.
>>>> 
>>>> With that said, there was a renewed discussion on the list [2] where
>>> Paul Rogers proposed what he described as a "Crude but Effective"
>> approach
>>> to an Arrow integration.
>>>> 
>>>> This is in the email link but here was a part of Paul's email:
>>>> 
>>>>> Charles, just brainstorming a bit, I think the easiest way to start is
>>> to create a simple, stand-alone server that speaks Arrow to the client,
>> and
>>> uses the native Drill client to speak to Drill. The native Drill client
>>> exposes Drill value vectors. One trick would be to convert Drill vectors
>> to
>>> the Arrow format. I think that data vectors are the same format. Possibly
>>> offset vectors. I think Arrow went its own way with null-value (Drill's
>>> is-set) vectors. So, some conversion might be a no-op, others might need
>> to
>>> rewrite a vector. Good thing, this is purely at the vector level, so
>> would
>>> be easy to write. The next issue is the one that Parth has long pointed
>>> out: Drill and Arrow each have their own memory allocators. How could we
>>> share a data vector between the two? The simplest initial solution is
>> just
>>> to copy the data from Drill to Arrow. Slow, but transparent to the
>> client.
>>> A crude first-approximation of the development steps:
>>>>> 
>>>>> A crude first-approximation of the development steps:
>>>>> 1. Create the client shell server.
>>>>> 2. Implement the Arrow client protocol. Need some way to accept a
>> query
>>> and return batches of results.
>>>>> 3. Forward the query to Drill using the native Drill client.
>>>>> 4. As a first pass, copy vectors from Drill to Arrow and return them
>> to
>>> the client.
>>>>> 5. Then, solve that memory allocator problem to pass data without
>>> copying.
>>>> 
>>>> One point that Paul made was that these pieces are fairly discrete and
>>> could be implemented without refactoring major components of Drill. Of
>>> course, this could be something for Drill 2.0. At a minimum, could we
>> take
>>> the conversation off of the PR and put it in the email list? ;-)
>>>> 
>>>> Let's discuss... All ideas are welcome!
>>>> 
>>>> Best,
>>>> -- C
>>>> 
>>>> 
>>>> [1]: https://github.com/apache/drill/pull/2412 <
>>> https://github.com/apache/drill/pull/2412>
>>>> [2]: https://lists.apache.org/thread/hcmygrv8q8jyw8p57fm9qy3vw2kqfr5l
>> <
>>> https://lists.apache.org/thread/hcmygrv8q8jyw8p57fm9qy3vw2kqfr5l>
>>>> 
>>>> 
>>>> 
>>> 
>>> 
>>

Re: [DISCUSS] Restarting the Arrow Conversation

Reply via email to