@Paul, Do you mind if I copy the contents of your response to DRILL-8088 to this thread? There's a lot of good info there, and I'd hate to see it get lost. -- C
> On Jan 3, 2022, at 7:41 PM, Paul Rogers <par0...@gmail.com> wrote: > > Hi All, > > Thanks Charles for dredging up that old discussion, your memory is better > than mine! And, thanks Ted for that summary of MapR history. As one of the > "replacement crew" brought in after the original folks left, your > description is consistent with my memory of events. Moreover, as we looked > at what was needed to run Drill in production, an Arrow port was far down > on the list: it would not have solved actual customer problems. > > Before we get too excited about Arrow, I think we should have a discussion > about what we want in an internal storage format. I added a long (sorry) > set of comments in that PR that Charles mentioned that tries to debunk the > myths that have grown up around using a columnar format as the internal > representation for a query engine. (Columnar is great for storage.) The > note presents the many issues we've encountered over the years that have > caused us to layer ever more code on top of vectors to solve various > problems. It also highlights a distributed-systems problem which vectors > make far worse. > > Arrow is meant to be portable, as Ted discussed, but it is still columnar, > and this is the source of endless problems in an execution engine. So, we > want to ask, what is the optimal format for what Drill actually does? I'm > now of the opinion that Drill might actually better benefit from a > row-based format, similar to what Impala uses. The notes even paint a path > forward. > > Ted's description of the goal for Demio suggests that Arrow might be the > right answer for that market. Drill, however, tends to be used to query > myriad data sources at scale and as a "query integrator" across systems. > This use case has different needs, which may be better served with a > row-based format. > > The upshot is that "value vectors vs. Arrow" is the wrong place to start > the discussion. The right place is "what does our many years of experience > with Drill suggest is the most efficient format for how Drill is actually > used?" > > Note that Drill could have an Arrow-based API independent of the internal > format. The quote from Charles explains how we could do that. > > Thanks, > > - Paul > > On Mon, Jan 3, 2022 at 12:54 PM Ted Dunning <ted.dunn...@gmail.com> wrote: > >> Christian, >> >> Your thoughts are very helpful. I find Arrow very nice (I use it in Agstack >> with Julia and Python). >> >> I don't think anybody is saying that Drill wouldn't be well set with a >> switch to Arrow or even just interfaces to Arrow. But it is a lot of work >> to make it all happen. >> >> >> >> On Mon, Jan 3, 2022 at 11:37 AM Z0ltrix <z0lt...@pm.me.invalid> wrote: >> >>> Hi Charles, Ted, and the others here, >>> >>> it is very interesting to hear the evolution of Drill, Dremio and Arrow >> in >>> that context and thank you Charles for restarting that discussion. >>> >>> I think, and James mentioned this in the PR as well, that Drill could >>> benefit from the continues progress, the Arrow project has made since its >>> separation from Drill. And the arrow Community seems to be large, so i >>> assume this goes on and on with improvements, new features, etc. but i >> have >>> not enough experience in Drill internals to have an Idea in which mass of >>> refactoring this would lead. >>> >>> In addition to that, im not aware of the current roadmap of Arrow and if >>> these would fit into Drills roadmap. Maybe Arrow would go into a >> different >>> direction than Drill and what should we do, if Drill is bound to Arrow >> then? >>> >>> On the other hand, Arrow could help Drill to a wider adoption with >> clients >>> like pyarrow, arrow-flight, various other programming languages etc. and >>> (im not sure about that) maybe its a performance benefit if Drill use >> Arrow >>> to read Data from HDFS(example), useses Arrow to work with it during >>> execution and gives the vectors directly to my Python(example) programm >> via >>> arrow-flight so that i can Play around with Pandas, etc. >>> >>> Just some thoughts i have since i have used Dremio with pyarrow and Drill >>> with odbc connections. >>> >>> Regards >>> Christian >>> -------- Original-Nachricht -------- >>> Am 3. Jan. 2022, 20:08, Charles Givre schrieb: >>> >>> >>> Thanks Ted for the perspective! I had always wished to be a "fly on the >>> wall" in those conversations. :-) >>> -- C >>> >>>> On Jan 3, 2022, at 11:00 AM, Charles Givre <cgi...@gmail.com> wrote: >>>> >>>> Hello all, >>>> There was a discussion in a recently closed PR [1] with a discussion >>> between z0ltrix, James Turton and a few others about integrating Drill >> with >>> Apache Arrow and wondering why it was never done. I'd like to share my >>> perspective as someone who has been around Drill for some time but also >> as >>> someone who never worked for MapR or Dremio. This just represents my >>> understanding of events as an outsider, and I could be wrong about some >> or >>> all of this. Please forgive (or correct) any inaccuracies. >>>> >>>> When I first learned of Arrow and the idea of integrating Arrow with >>> Drill, the thing that interested me the most was the ability to move data >>> between platforms without having to serialize/deserialize the data. From >> my >>> understanding, MapR did some research and didn't find a significant >>> performance advantage and hence didn't really pursue the integration. The >>> other side of it was that it would require a significant amount of work >> to >>> refactor major parts of Drill. >>>> >>>> I don't know the internal politics, but this was one of the major >> points >>> of diversion between Dremio and Drill. >>>> >>>> With that said, there was a renewed discussion on the list [2] where >>> Paul Rogers proposed what he described as a "Crude but Effective" >> approach >>> to an Arrow integration. >>>> >>>> This is in the email link but here was a part of Paul's email: >>>> >>>>> Charles, just brainstorming a bit, I think the easiest way to start is >>> to create a simple, stand-alone server that speaks Arrow to the client, >> and >>> uses the native Drill client to speak to Drill. The native Drill client >>> exposes Drill value vectors. One trick would be to convert Drill vectors >> to >>> the Arrow format. I think that data vectors are the same format. Possibly >>> offset vectors. I think Arrow went its own way with null-value (Drill's >>> is-set) vectors. So, some conversion might be a no-op, others might need >> to >>> rewrite a vector. Good thing, this is purely at the vector level, so >> would >>> be easy to write. The next issue is the one that Parth has long pointed >>> out: Drill and Arrow each have their own memory allocators. How could we >>> share a data vector between the two? The simplest initial solution is >> just >>> to copy the data from Drill to Arrow. Slow, but transparent to the >> client. >>> A crude first-approximation of the development steps: >>>>> >>>>> A crude first-approximation of the development steps: >>>>> 1. Create the client shell server. >>>>> 2. Implement the Arrow client protocol. Need some way to accept a >> query >>> and return batches of results. >>>>> 3. Forward the query to Drill using the native Drill client. >>>>> 4. As a first pass, copy vectors from Drill to Arrow and return them >> to >>> the client. >>>>> 5. Then, solve that memory allocator problem to pass data without >>> copying. >>>> >>>> One point that Paul made was that these pieces are fairly discrete and >>> could be implemented without refactoring major components of Drill. Of >>> course, this could be something for Drill 2.0. At a minimum, could we >> take >>> the conversation off of the PR and put it in the email list? ;-) >>>> >>>> Let's discuss... All ideas are welcome! >>>> >>>> Best, >>>> -- C >>>> >>>> >>>> [1]: https://github.com/apache/drill/pull/2412 < >>> https://github.com/apache/drill/pull/2412> >>>> [2]: https://lists.apache.org/thread/hcmygrv8q8jyw8p57fm9qy3vw2kqfr5l >> < >>> https://lists.apache.org/thread/hcmygrv8q8jyw8p57fm9qy3vw2kqfr5l> >>>> >>>> >>>> >>> >>> >>