Re: [DISCUSS] Restarting the Arrow Conversation

Ted Dunning Mon, 03 Jan 2022 12:54:02 -0800

Christian,

Your thoughts are very helpful. I find Arrow very nice (I use it in Agstack
with Julia and Python).


I don't think anybody is saying that Drill wouldn't be well set with a
switch to Arrow or even just interfaces to Arrow. But it is a lot of work
to make it all happen.



On Mon, Jan 3, 2022 at 11:37 AM Z0ltrix <[email protected]> wrote:

> Hi Charles, Ted, and the others here,
>
> it is very interesting to hear the evolution of Drill, Dremio and Arrow in
> that context and thank you Charles for restarting that discussion.
>
> I think, and James mentioned this in the PR as well, that Drill could
> benefit from the continues progress, the Arrow project has made since its
> separation from Drill. And the arrow Community seems to be large, so i
> assume this goes on and on with improvements, new features, etc. but i have
> not enough experience in Drill internals to have an Idea in which mass of
> refactoring this would lead.
>
> In addition to that, im not aware of the current roadmap of Arrow and if
> these would fit into Drills roadmap. Maybe Arrow would go into a different
> direction than Drill and what should we do, if Drill is bound to Arrow then?
>
> On the other hand, Arrow could help Drill to a wider adoption with clients
> like pyarrow, arrow-flight, various other programming languages etc. and
> (im not sure about that) maybe its a performance benefit if Drill use Arrow
> to read Data from HDFS(example), useses Arrow to work with it during
> execution and gives the vectors directly to my Python(example) programm via
> arrow-flight so that i can Play around with Pandas, etc.
>
> Just some thoughts i have since i have used Dremio with pyarrow and Drill
> with odbc connections.
>
> Regards
> Christian
> -------- Original-Nachricht --------
> Am 3. Jan. 2022, 20:08, Charles Givre schrieb:
>
>
> Thanks Ted for the perspective! I had always wished to be a "fly on the
> wall" in those conversations. :-)
> -- C
>
> > On Jan 3, 2022, at 11:00 AM, Charles Givre <[email protected]> wrote:
> >
> > Hello all,
> > There was a discussion in a recently closed PR [1] with a discussion
> between z0ltrix, James Turton and a few others about integrating Drill with
> Apache Arrow and wondering why it was never done. I'd like to share my
> perspective as someone who has been around Drill for some time but also as
> someone who never worked for MapR or Dremio. This just represents my
> understanding of events as an outsider, and I could be wrong about some or
> all of this. Please forgive (or correct) any inaccuracies.
> >
> > When I first learned of Arrow and the idea of integrating Arrow with
> Drill, the thing that interested me the most was the ability to move data
> between platforms without having to serialize/deserialize the data. From my
> understanding, MapR did some research and didn't find a significant
> performance advantage and hence didn't really pursue the integration. The
> other side of it was that it would require a significant amount of work to
> refactor major parts of Drill.
> >
> > I don't know the internal politics, but this was one of the major points
> of diversion between Dremio and Drill.
> >
> > With that said, there was a renewed discussion on the list [2] where
> Paul Rogers proposed what he described as a "Crude but Effective" approach
> to an Arrow integration.
> >
> > This is in the email link but here was a part of Paul's email:
> >
> >> Charles, just brainstorming a bit, I think the easiest way to start is
> to create a simple, stand-alone server that speaks Arrow to the client, and
> uses the native Drill client to speak to Drill. The native Drill client
> exposes Drill value vectors. One trick would be to convert Drill vectors to
> the Arrow format. I think that data vectors are the same format. Possibly
> offset vectors. I think Arrow went its own way with null-value (Drill's
> is-set) vectors. So, some conversion might be a no-op, others might need to
> rewrite a vector. Good thing, this is purely at the vector level, so would
> be easy to write. The next issue is the one that Parth has long pointed
> out: Drill and Arrow each have their own memory allocators. How could we
> share a data vector between the two? The simplest initial solution is just
> to copy the data from Drill to Arrow. Slow, but transparent to the client.
> A crude first-approximation of the development steps:
> >>
> >> A crude first-approximation of the development steps:
> >> 1. Create the client shell server.
> >> 2. Implement the Arrow client protocol. Need some way to accept a query
> and return batches of results.
> >> 3. Forward the query to Drill using the native Drill client.
> >> 4. As a first pass, copy vectors from Drill to Arrow and return them to
> the client.
> >> 5. Then, solve that memory allocator problem to pass data without
> copying.
> >
> > One point that Paul made was that these pieces are fairly discrete and
> could be implemented without refactoring major components of Drill. Of
> course, this could be something for Drill 2.0. At a minimum, could we take
> the conversation off of the PR and put it in the email list? ;-)
> >
> > Let's discuss... All ideas are welcome!
> >
> > Best,
> > -- C
> >
> >
> > [1]: https://github.com/apache/drill/pull/2412 <
> https://github.com/apache/drill/pull/2412>
> > [2]: https://lists.apache.org/thread/hcmygrv8q8jyw8p57fm9qy3vw2kqfr5l <
> https://lists.apache.org/thread/hcmygrv8q8jyw8p57fm9qy3vw2kqfr5l>
> >
> >
> >
>
>

Re: [DISCUSS] Restarting the Arrow Conversation

Reply via email to