Hi Charles, The material is rather dense and benefits from the Github formatting. To preserve it, perhaps we can copy it to a subpage of the Drill 2.0 wiki page.
For now, the link to the discussion is [1]. Since the Wiki is not good for discussions, let's have that discussion here (if anyone is up to tackling such a weighty subject.) Thanks, - Paul [1] https://github.com/apache/drill/pull/2412 On Mon, Jan 3, 2022 at 5:15 PM Charles Givre <cgi...@gmail.com> wrote: > @Paul, > Do you mind if I copy the contents of your response to DRILL-8088 to this > thread? There's a lot of good info there, and I'd hate to see it get lost. > -- C > > > On Jan 3, 2022, at 7:41 PM, Paul Rogers <par0...@gmail.com> wrote: > > > > Hi All, > > > > Thanks Charles for dredging up that old discussion, your memory is better > > than mine! And, thanks Ted for that summary of MapR history. As one of > the > > "replacement crew" brought in after the original folks left, your > > description is consistent with my memory of events. Moreover, as we > looked > > at what was needed to run Drill in production, an Arrow port was far down > > on the list: it would not have solved actual customer problems. > > > > Before we get too excited about Arrow, I think we should have a > discussion > > about what we want in an internal storage format. I added a long (sorry) > > set of comments in that PR that Charles mentioned that tries to debunk > the > > myths that have grown up around using a columnar format as the internal > > representation for a query engine. (Columnar is great for storage.) The > > note presents the many issues we've encountered over the years that have > > caused us to layer ever more code on top of vectors to solve various > > problems. It also highlights a distributed-systems problem which vectors > > make far worse. > > > > Arrow is meant to be portable, as Ted discussed, but it is still > columnar, > > and this is the source of endless problems in an execution engine. So, we > > want to ask, what is the optimal format for what Drill actually does? I'm > > now of the opinion that Drill might actually better benefit from a > > row-based format, similar to what Impala uses. The notes even paint a > path > > forward. > > > > Ted's description of the goal for Demio suggests that Arrow might be the > > right answer for that market. Drill, however, tends to be used to query > > myriad data sources at scale and as a "query integrator" across systems. > > This use case has different needs, which may be better served with a > > row-based format. > > > > The upshot is that "value vectors vs. Arrow" is the wrong place to start > > the discussion. The right place is "what does our many years of > experience > > with Drill suggest is the most efficient format for how Drill is actually > > used?" > > > > Note that Drill could have an Arrow-based API independent of the internal > > format. The quote from Charles explains how we could do that. > > > > Thanks, > > > > - Paul > > > > On Mon, Jan 3, 2022 at 12:54 PM Ted Dunning <ted.dunn...@gmail.com> > wrote: > > > >> Christian, > >> > >> Your thoughts are very helpful. I find Arrow very nice (I use it in > Agstack > >> with Julia and Python). > >> > >> I don't think anybody is saying that Drill wouldn't be well set with a > >> switch to Arrow or even just interfaces to Arrow. But it is a lot of > work > >> to make it all happen. > >> > >> > >> > >> On Mon, Jan 3, 2022 at 11:37 AM Z0ltrix <z0lt...@pm.me.invalid> wrote: > >> > >>> Hi Charles, Ted, and the others here, > >>> > >>> it is very interesting to hear the evolution of Drill, Dremio and Arrow > >> in > >>> that context and thank you Charles for restarting that discussion. > >>> > >>> I think, and James mentioned this in the PR as well, that Drill could > >>> benefit from the continues progress, the Arrow project has made since > its > >>> separation from Drill. And the arrow Community seems to be large, so i > >>> assume this goes on and on with improvements, new features, etc. but i > >> have > >>> not enough experience in Drill internals to have an Idea in which mass > of > >>> refactoring this would lead. > >>> > >>> In addition to that, im not aware of the current roadmap of Arrow and > if > >>> these would fit into Drills roadmap. Maybe Arrow would go into a > >> different > >>> direction than Drill and what should we do, if Drill is bound to Arrow > >> then? > >>> > >>> On the other hand, Arrow could help Drill to a wider adoption with > >> clients > >>> like pyarrow, arrow-flight, various other programming languages etc. > and > >>> (im not sure about that) maybe its a performance benefit if Drill use > >> Arrow > >>> to read Data from HDFS(example), useses Arrow to work with it during > >>> execution and gives the vectors directly to my Python(example) programm > >> via > >>> arrow-flight so that i can Play around with Pandas, etc. > >>> > >>> Just some thoughts i have since i have used Dremio with pyarrow and > Drill > >>> with odbc connections. > >>> > >>> Regards > >>> Christian > >>> -------- Original-Nachricht -------- > >>> Am 3. Jan. 2022, 20:08, Charles Givre schrieb: > >>> > >>> > >>> Thanks Ted for the perspective! I had always wished to be a "fly on the > >>> wall" in those conversations. :-) > >>> -- C > >>> > >>>> On Jan 3, 2022, at 11:00 AM, Charles Givre <cgi...@gmail.com> wrote: > >>>> > >>>> Hello all, > >>>> There was a discussion in a recently closed PR [1] with a discussion > >>> between z0ltrix, James Turton and a few others about integrating Drill > >> with > >>> Apache Arrow and wondering why it was never done. I'd like to share my > >>> perspective as someone who has been around Drill for some time but also > >> as > >>> someone who never worked for MapR or Dremio. This just represents my > >>> understanding of events as an outsider, and I could be wrong about some > >> or > >>> all of this. Please forgive (or correct) any inaccuracies. > >>>> > >>>> When I first learned of Arrow and the idea of integrating Arrow with > >>> Drill, the thing that interested me the most was the ability to move > data > >>> between platforms without having to serialize/deserialize the data. > From > >> my > >>> understanding, MapR did some research and didn't find a significant > >>> performance advantage and hence didn't really pursue the integration. > The > >>> other side of it was that it would require a significant amount of work > >> to > >>> refactor major parts of Drill. > >>>> > >>>> I don't know the internal politics, but this was one of the major > >> points > >>> of diversion between Dremio and Drill. > >>>> > >>>> With that said, there was a renewed discussion on the list [2] where > >>> Paul Rogers proposed what he described as a "Crude but Effective" > >> approach > >>> to an Arrow integration. > >>>> > >>>> This is in the email link but here was a part of Paul's email: > >>>> > >>>>> Charles, just brainstorming a bit, I think the easiest way to start > is > >>> to create a simple, stand-alone server that speaks Arrow to the client, > >> and > >>> uses the native Drill client to speak to Drill. The native Drill client > >>> exposes Drill value vectors. One trick would be to convert Drill > vectors > >> to > >>> the Arrow format. I think that data vectors are the same format. > Possibly > >>> offset vectors. I think Arrow went its own way with null-value (Drill's > >>> is-set) vectors. So, some conversion might be a no-op, others might > need > >> to > >>> rewrite a vector. Good thing, this is purely at the vector level, so > >> would > >>> be easy to write. The next issue is the one that Parth has long pointed > >>> out: Drill and Arrow each have their own memory allocators. How could > we > >>> share a data vector between the two? The simplest initial solution is > >> just > >>> to copy the data from Drill to Arrow. Slow, but transparent to the > >> client. > >>> A crude first-approximation of the development steps: > >>>>> > >>>>> A crude first-approximation of the development steps: > >>>>> 1. Create the client shell server. > >>>>> 2. Implement the Arrow client protocol. Need some way to accept a > >> query > >>> and return batches of results. > >>>>> 3. Forward the query to Drill using the native Drill client. > >>>>> 4. As a first pass, copy vectors from Drill to Arrow and return them > >> to > >>> the client. > >>>>> 5. Then, solve that memory allocator problem to pass data without > >>> copying. > >>>> > >>>> One point that Paul made was that these pieces are fairly discrete and > >>> could be implemented without refactoring major components of Drill. Of > >>> course, this could be something for Drill 2.0. At a minimum, could we > >> take > >>> the conversation off of the PR and put it in the email list? ;-) > >>>> > >>>> Let's discuss... All ideas are welcome! > >>>> > >>>> Best, > >>>> -- C > >>>> > >>>> > >>>> [1]: https://github.com/apache/drill/pull/2412 < > >>> https://github.com/apache/drill/pull/2412> > >>>> [2]: https://lists.apache.org/thread/hcmygrv8q8jyw8p57fm9qy3vw2kqfr5l > >> < > >>> https://lists.apache.org/thread/hcmygrv8q8jyw8p57fm9qy3vw2kqfr5l> > >>>> > >>>> > >>>> > >>> > >>> > >> > >