I wonder if there isn't a better place for this discussion? As you point out, there are many threads and many of the points are rather contentious technically. That will make them even harder to follow in an email thread.
We could just use the wiki and format the text in the form of questions with alternative positions. Or we could use an open google document with similar form. What's the preference here? On Mon, Jan 3, 2022 at 7:34 PM Paul Rogers <[email protected]> wrote: > Hi Charles, > > The material is rather dense and benefits from the Github formatting. To > preserve it, perhaps we can copy it to a subpage of the Drill 2.0 wiki > page. > > For now, the link to the discussion is [1]. Since the Wiki is not good for > discussions, let's have that discussion here (if anyone is up to tackling > such a weighty subject.) > > Thanks, > > - Paul > > [1] https://github.com/apache/drill/pull/2412 > > On Mon, Jan 3, 2022 at 5:15 PM Charles Givre <[email protected]> wrote: > > > @Paul, > > Do you mind if I copy the contents of your response to DRILL-8088 to this > > thread? There's a lot of good info there, and I'd hate to see it get > lost. > > -- C > > > > > On Jan 3, 2022, at 7:41 PM, Paul Rogers <[email protected]> wrote: > > > > > > Hi All, > > > > > > Thanks Charles for dredging up that old discussion, your memory is > better > > > than mine! And, thanks Ted for that summary of MapR history. As one of > > the > > > "replacement crew" brought in after the original folks left, your > > > description is consistent with my memory of events. Moreover, as we > > looked > > > at what was needed to run Drill in production, an Arrow port was far > down > > > on the list: it would not have solved actual customer problems. > > > > > > Before we get too excited about Arrow, I think we should have a > > discussion > > > about what we want in an internal storage format. I added a long > (sorry) > > > set of comments in that PR that Charles mentioned that tries to debunk > > the > > > myths that have grown up around using a columnar format as the internal > > > representation for a query engine. (Columnar is great for storage.) The > > > note presents the many issues we've encountered over the years that > have > > > caused us to layer ever more code on top of vectors to solve various > > > problems. It also highlights a distributed-systems problem which > vectors > > > make far worse. > > > > > > Arrow is meant to be portable, as Ted discussed, but it is still > > columnar, > > > and this is the source of endless problems in an execution engine. So, > we > > > want to ask, what is the optimal format for what Drill actually does? > I'm > > > now of the opinion that Drill might actually better benefit from a > > > row-based format, similar to what Impala uses. The notes even paint a > > path > > > forward. > > > > > > Ted's description of the goal for Demio suggests that Arrow might be > the > > > right answer for that market. Drill, however, tends to be used to query > > > myriad data sources at scale and as a "query integrator" across > systems. > > > This use case has different needs, which may be better served with a > > > row-based format. > > > > > > The upshot is that "value vectors vs. Arrow" is the wrong place to > start > > > the discussion. The right place is "what does our many years of > > experience > > > with Drill suggest is the most efficient format for how Drill is > actually > > > used?" > > > > > > Note that Drill could have an Arrow-based API independent of the > internal > > > format. The quote from Charles explains how we could do that. > > > > > > Thanks, > > > > > > - Paul > > > > > > On Mon, Jan 3, 2022 at 12:54 PM Ted Dunning <[email protected]> > > wrote: > > > > > >> Christian, > > >> > > >> Your thoughts are very helpful. I find Arrow very nice (I use it in > > Agstack > > >> with Julia and Python). > > >> > > >> I don't think anybody is saying that Drill wouldn't be well set with a > > >> switch to Arrow or even just interfaces to Arrow. But it is a lot of > > work > > >> to make it all happen. > > >> > > >> > > >> > > >> On Mon, Jan 3, 2022 at 11:37 AM Z0ltrix <[email protected]> > wrote: > > >> > > >>> Hi Charles, Ted, and the others here, > > >>> > > >>> it is very interesting to hear the evolution of Drill, Dremio and > Arrow > > >> in > > >>> that context and thank you Charles for restarting that discussion. > > >>> > > >>> I think, and James mentioned this in the PR as well, that Drill could > > >>> benefit from the continues progress, the Arrow project has made since > > its > > >>> separation from Drill. And the arrow Community seems to be large, so > i > > >>> assume this goes on and on with improvements, new features, etc. but > i > > >> have > > >>> not enough experience in Drill internals to have an Idea in which > mass > > of > > >>> refactoring this would lead. > > >>> > > >>> In addition to that, im not aware of the current roadmap of Arrow and > > if > > >>> these would fit into Drills roadmap. Maybe Arrow would go into a > > >> different > > >>> direction than Drill and what should we do, if Drill is bound to > Arrow > > >> then? > > >>> > > >>> On the other hand, Arrow could help Drill to a wider adoption with > > >> clients > > >>> like pyarrow, arrow-flight, various other programming languages etc. > > and > > >>> (im not sure about that) maybe its a performance benefit if Drill use > > >> Arrow > > >>> to read Data from HDFS(example), useses Arrow to work with it during > > >>> execution and gives the vectors directly to my Python(example) > programm > > >> via > > >>> arrow-flight so that i can Play around with Pandas, etc. > > >>> > > >>> Just some thoughts i have since i have used Dremio with pyarrow and > > Drill > > >>> with odbc connections. > > >>> > > >>> Regards > > >>> Christian > > >>> -------- Original-Nachricht -------- > > >>> Am 3. Jan. 2022, 20:08, Charles Givre schrieb: > > >>> > > >>> > > >>> Thanks Ted for the perspective! I had always wished to be a "fly on > the > > >>> wall" in those conversations. :-) > > >>> -- C > > >>> > > >>>> On Jan 3, 2022, at 11:00 AM, Charles Givre <[email protected]> > wrote: > > >>>> > > >>>> Hello all, > > >>>> There was a discussion in a recently closed PR [1] with a discussion > > >>> between z0ltrix, James Turton and a few others about integrating > Drill > > >> with > > >>> Apache Arrow and wondering why it was never done. I'd like to share > my > > >>> perspective as someone who has been around Drill for some time but > also > > >> as > > >>> someone who never worked for MapR or Dremio. This just represents my > > >>> understanding of events as an outsider, and I could be wrong about > some > > >> or > > >>> all of this. Please forgive (or correct) any inaccuracies. > > >>>> > > >>>> When I first learned of Arrow and the idea of integrating Arrow with > > >>> Drill, the thing that interested me the most was the ability to move > > data > > >>> between platforms without having to serialize/deserialize the data. > > From > > >> my > > >>> understanding, MapR did some research and didn't find a significant > > >>> performance advantage and hence didn't really pursue the integration. > > The > > >>> other side of it was that it would require a significant amount of > work > > >> to > > >>> refactor major parts of Drill. > > >>>> > > >>>> I don't know the internal politics, but this was one of the major > > >> points > > >>> of diversion between Dremio and Drill. > > >>>> > > >>>> With that said, there was a renewed discussion on the list [2] where > > >>> Paul Rogers proposed what he described as a "Crude but Effective" > > >> approach > > >>> to an Arrow integration. > > >>>> > > >>>> This is in the email link but here was a part of Paul's email: > > >>>> > > >>>>> Charles, just brainstorming a bit, I think the easiest way to start > > is > > >>> to create a simple, stand-alone server that speaks Arrow to the > client, > > >> and > > >>> uses the native Drill client to speak to Drill. The native Drill > client > > >>> exposes Drill value vectors. One trick would be to convert Drill > > vectors > > >> to > > >>> the Arrow format. I think that data vectors are the same format. > > Possibly > > >>> offset vectors. I think Arrow went its own way with null-value > (Drill's > > >>> is-set) vectors. So, some conversion might be a no-op, others might > > need > > >> to > > >>> rewrite a vector. Good thing, this is purely at the vector level, so > > >> would > > >>> be easy to write. The next issue is the one that Parth has long > pointed > > >>> out: Drill and Arrow each have their own memory allocators. How could > > we > > >>> share a data vector between the two? The simplest initial solution is > > >> just > > >>> to copy the data from Drill to Arrow. Slow, but transparent to the > > >> client. > > >>> A crude first-approximation of the development steps: > > >>>>> > > >>>>> A crude first-approximation of the development steps: > > >>>>> 1. Create the client shell server. > > >>>>> 2. Implement the Arrow client protocol. Need some way to accept a > > >> query > > >>> and return batches of results. > > >>>>> 3. Forward the query to Drill using the native Drill client. > > >>>>> 4. As a first pass, copy vectors from Drill to Arrow and return > them > > >> to > > >>> the client. > > >>>>> 5. Then, solve that memory allocator problem to pass data without > > >>> copying. > > >>>> > > >>>> One point that Paul made was that these pieces are fairly discrete > and > > >>> could be implemented without refactoring major components of Drill. > Of > > >>> course, this could be something for Drill 2.0. At a minimum, could we > > >> take > > >>> the conversation off of the PR and put it in the email list? ;-) > > >>>> > > >>>> Let's discuss... All ideas are welcome! > > >>>> > > >>>> Best, > > >>>> -- C > > >>>> > > >>>> > > >>>> [1]: https://github.com/apache/drill/pull/2412 < > > >>> https://github.com/apache/drill/pull/2412> > > >>>> [2]: > https://lists.apache.org/thread/hcmygrv8q8jyw8p57fm9qy3vw2kqfr5l > > >> < > > >>> https://lists.apache.org/thread/hcmygrv8q8jyw8p57fm9qy3vw2kqfr5l> > > >>>> > > >>>> > > >>>> > > >>> > > >>> > > >> > > > > >
