Re: [DISCUSS] Restarting the Arrow Conversation

Paul Rogers Mon, 03 Jan 2022 19:34:20 -0800

Hi Charles,

The material is rather dense and benefits from the Github formatting. To
preserve it, perhaps we can copy it to a subpage of the Drill 2.0 wiki page.


For now, the link to the discussion is [1]. Since the Wiki is not good for
discussions, let's have that discussion here (if anyone is up to tackling
such a weighty subject.)

Thanks,

- Paul

[1] https://github.com/apache/drill/pull/2412

On Mon, Jan 3, 2022 at 5:15 PM Charles Givre <cgi...@gmail.com> wrote:

> @Paul,
> Do you mind if I copy the contents of your response to DRILL-8088 to this
> thread?   There's a lot of good info there, and I'd hate to see it get lost.
> -- C
>
> > On Jan 3, 2022, at 7:41 PM, Paul Rogers <par0...@gmail.com> wrote:
> >
> > Hi All,
> >
> > Thanks Charles for dredging up that old discussion, your memory is better
> > than mine! And, thanks Ted for that summary of MapR history. As one of
> the
> > "replacement crew" brought in after the original folks left, your
> > description is consistent with my memory of events. Moreover, as we
> looked
> > at what was needed to run Drill in production, an Arrow port was far down
> > on the list: it would not have solved actual customer problems.
> >
> > Before we get too excited about Arrow, I think we should have a
> discussion
> > about what we want in an internal storage format. I added a long (sorry)
> > set of comments in that PR that Charles mentioned that tries to debunk
> the
> > myths that have grown up around using a columnar format as the internal
> > representation for a query engine. (Columnar is great for storage.) The
> > note presents the many issues we've encountered over the years that have
> > caused us to layer ever more code on top of vectors to solve various
> > problems. It also highlights a distributed-systems problem which vectors
> > make far worse.
> >
> > Arrow is meant to be portable, as Ted discussed, but it is still
> columnar,
> > and this is the source of endless problems in an execution engine. So, we
> > want to ask, what is the optimal format for what Drill actually does? I'm
> > now of the opinion that Drill might actually better benefit  from a
> > row-based format, similar to what Impala uses. The notes even paint a
> path
> > forward.
> >
> > Ted's description of the goal for Demio suggests that Arrow might be the
> > right answer for that market. Drill, however, tends to be used to query
> > myriad data sources at scale and as a "query integrator" across systems.
> > This use case has different needs, which may be better served with a
> > row-based format.
> >
> > The upshot is that "value vectors vs. Arrow" is the wrong place to start
> > the discussion. The right place is "what does our many years of
> experience
> > with Drill suggest is the most efficient format for how Drill is actually
> > used?"
> >
> > Note that Drill could have an Arrow-based API independent of the internal
> > format. The quote from Charles explains how we could do that.
> >
> > Thanks,
> >
> > - Paul
> >
> > On Mon, Jan 3, 2022 at 12:54 PM Ted Dunning <ted.dunn...@gmail.com>
> wrote:
> >
> >> Christian,
> >>
> >> Your thoughts are very helpful. I find Arrow very nice (I use it in
> Agstack
> >> with Julia and Python).
> >>
> >> I don't think anybody is saying that Drill wouldn't be well set with a
> >> switch to Arrow or even just interfaces to Arrow. But it is a lot of
> work
> >> to make it all happen.
> >>
> >>
> >>
> >> On Mon, Jan 3, 2022 at 11:37 AM Z0ltrix <z0lt...@pm.me.invalid> wrote:
> >>
> >>> Hi Charles, Ted, and the others here,
> >>>
> >>> it is very interesting to hear the evolution of Drill, Dremio and Arrow
> >> in
> >>> that context and thank you Charles for restarting that discussion.
> >>>
> >>> I think, and James mentioned this in the PR as well, that Drill could
> >>> benefit from the continues progress, the Arrow project has made since
> its
> >>> separation from Drill. And the arrow Community seems to be large, so i
> >>> assume this goes on and on with improvements, new features, etc. but i
> >> have
> >>> not enough experience in Drill internals to have an Idea in which mass
> of
> >>> refactoring this would lead.
> >>>
> >>> In addition to that, im not aware of the current roadmap of Arrow and
> if
> >>> these would fit into Drills roadmap. Maybe Arrow would go into a
> >> different
> >>> direction than Drill and what should we do, if Drill is bound to Arrow
> >> then?
> >>>
> >>> On the other hand, Arrow could help Drill to a wider adoption with
> >> clients
> >>> like pyarrow, arrow-flight, various other programming languages etc.
> and
> >>> (im not sure about that) maybe its a performance benefit if Drill use
> >> Arrow
> >>> to read Data from HDFS(example), useses Arrow to work with it during
> >>> execution and gives the vectors directly to my Python(example) programm
> >> via
> >>> arrow-flight so that i can Play around with Pandas, etc.
> >>>
> >>> Just some thoughts i have since i have used Dremio with pyarrow and
> Drill
> >>> with odbc connections.
> >>>
> >>> Regards
> >>> Christian
> >>> -------- Original-Nachricht --------
> >>> Am 3. Jan. 2022, 20:08, Charles Givre schrieb:
> >>>
> >>>
> >>> Thanks Ted for the perspective! I had always wished to be a "fly on the
> >>> wall" in those conversations. :-)
> >>> -- C
> >>>
> >>>> On Jan 3, 2022, at 11:00 AM, Charles Givre <cgi...@gmail.com> wrote:
> >>>>
> >>>> Hello all,
> >>>> There was a discussion in a recently closed PR [1] with a discussion
> >>> between z0ltrix, James Turton and a few others about integrating Drill
> >> with
> >>> Apache Arrow and wondering why it was never done. I'd like to share my
> >>> perspective as someone who has been around Drill for some time but also
> >> as
> >>> someone who never worked for MapR or Dremio. This just represents my
> >>> understanding of events as an outsider, and I could be wrong about some
> >> or
> >>> all of this. Please forgive (or correct) any inaccuracies.
> >>>>
> >>>> When I first learned of Arrow and the idea of integrating Arrow with
> >>> Drill, the thing that interested me the most was the ability to move
> data
> >>> between platforms without having to serialize/deserialize the data.
> From
> >> my
> >>> understanding, MapR did some research and didn't find a significant
> >>> performance advantage and hence didn't really pursue the integration.
> The
> >>> other side of it was that it would require a significant amount of work
> >> to
> >>> refactor major parts of Drill.
> >>>>
> >>>> I don't know the internal politics, but this was one of the major
> >> points
> >>> of diversion between Dremio and Drill.
> >>>>
> >>>> With that said, there was a renewed discussion on the list [2] where
> >>> Paul Rogers proposed what he described as a "Crude but Effective"
> >> approach
> >>> to an Arrow integration.
> >>>>
> >>>> This is in the email link but here was a part of Paul's email:
> >>>>
> >>>>> Charles, just brainstorming a bit, I think the easiest way to start
> is
> >>> to create a simple, stand-alone server that speaks Arrow to the client,
> >> and
> >>> uses the native Drill client to speak to Drill. The native Drill client
> >>> exposes Drill value vectors. One trick would be to convert Drill
> vectors
> >> to
> >>> the Arrow format. I think that data vectors are the same format.
> Possibly
> >>> offset vectors. I think Arrow went its own way with null-value (Drill's
> >>> is-set) vectors. So, some conversion might be a no-op, others might
> need
> >> to
> >>> rewrite a vector. Good thing, this is purely at the vector level, so
> >> would
> >>> be easy to write. The next issue is the one that Parth has long pointed
> >>> out: Drill and Arrow each have their own memory allocators. How could
> we
> >>> share a data vector between the two? The simplest initial solution is
> >> just
> >>> to copy the data from Drill to Arrow. Slow, but transparent to the
> >> client.
> >>> A crude first-approximation of the development steps:
> >>>>>
> >>>>> A crude first-approximation of the development steps:
> >>>>> 1. Create the client shell server.
> >>>>> 2. Implement the Arrow client protocol. Need some way to accept a
> >> query
> >>> and return batches of results.
> >>>>> 3. Forward the query to Drill using the native Drill client.
> >>>>> 4. As a first pass, copy vectors from Drill to Arrow and return them
> >> to
> >>> the client.
> >>>>> 5. Then, solve that memory allocator problem to pass data without
> >>> copying.
> >>>>
> >>>> One point that Paul made was that these pieces are fairly discrete and
> >>> could be implemented without refactoring major components of Drill. Of
> >>> course, this could be something for Drill 2.0. At a minimum, could we
> >> take
> >>> the conversation off of the PR and put it in the email list? ;-)
> >>>>
> >>>> Let's discuss... All ideas are welcome!
> >>>>
> >>>> Best,
> >>>> -- C
> >>>>
> >>>>
> >>>> [1]: https://github.com/apache/drill/pull/2412 <
> >>> https://github.com/apache/drill/pull/2412>
> >>>> [2]: https://lists.apache.org/thread/hcmygrv8q8jyw8p57fm9qy3vw2kqfr5l
> >> <
> >>> https://lists.apache.org/thread/hcmygrv8q8jyw8p57fm9qy3vw2kqfr5l>
> >>>>
> >>>>
> >>>>
> >>>
> >>>
> >>
>
>

Re: [DISCUSS] Restarting the Arrow Conversation

Reply via email to