Christian, Your thoughts are very helpful. I find Arrow very nice (I use it in Agstack with Julia and Python).
I don't think anybody is saying that Drill wouldn't be well set with a switch to Arrow or even just interfaces to Arrow. But it is a lot of work to make it all happen. On Mon, Jan 3, 2022 at 11:37 AM Z0ltrix <[email protected]> wrote: > Hi Charles, Ted, and the others here, > > it is very interesting to hear the evolution of Drill, Dremio and Arrow in > that context and thank you Charles for restarting that discussion. > > I think, and James mentioned this in the PR as well, that Drill could > benefit from the continues progress, the Arrow project has made since its > separation from Drill. And the arrow Community seems to be large, so i > assume this goes on and on with improvements, new features, etc. but i have > not enough experience in Drill internals to have an Idea in which mass of > refactoring this would lead. > > In addition to that, im not aware of the current roadmap of Arrow and if > these would fit into Drills roadmap. Maybe Arrow would go into a different > direction than Drill and what should we do, if Drill is bound to Arrow then? > > On the other hand, Arrow could help Drill to a wider adoption with clients > like pyarrow, arrow-flight, various other programming languages etc. and > (im not sure about that) maybe its a performance benefit if Drill use Arrow > to read Data from HDFS(example), useses Arrow to work with it during > execution and gives the vectors directly to my Python(example) programm via > arrow-flight so that i can Play around with Pandas, etc. > > Just some thoughts i have since i have used Dremio with pyarrow and Drill > with odbc connections. > > Regards > Christian > -------- Original-Nachricht -------- > Am 3. Jan. 2022, 20:08, Charles Givre schrieb: > > > Thanks Ted for the perspective! I had always wished to be a "fly on the > wall" in those conversations. :-) > -- C > > > On Jan 3, 2022, at 11:00 AM, Charles Givre <[email protected]> wrote: > > > > Hello all, > > There was a discussion in a recently closed PR [1] with a discussion > between z0ltrix, James Turton and a few others about integrating Drill with > Apache Arrow and wondering why it was never done. I'd like to share my > perspective as someone who has been around Drill for some time but also as > someone who never worked for MapR or Dremio. This just represents my > understanding of events as an outsider, and I could be wrong about some or > all of this. Please forgive (or correct) any inaccuracies. > > > > When I first learned of Arrow and the idea of integrating Arrow with > Drill, the thing that interested me the most was the ability to move data > between platforms without having to serialize/deserialize the data. From my > understanding, MapR did some research and didn't find a significant > performance advantage and hence didn't really pursue the integration. The > other side of it was that it would require a significant amount of work to > refactor major parts of Drill. > > > > I don't know the internal politics, but this was one of the major points > of diversion between Dremio and Drill. > > > > With that said, there was a renewed discussion on the list [2] where > Paul Rogers proposed what he described as a "Crude but Effective" approach > to an Arrow integration. > > > > This is in the email link but here was a part of Paul's email: > > > >> Charles, just brainstorming a bit, I think the easiest way to start is > to create a simple, stand-alone server that speaks Arrow to the client, and > uses the native Drill client to speak to Drill. The native Drill client > exposes Drill value vectors. One trick would be to convert Drill vectors to > the Arrow format. I think that data vectors are the same format. Possibly > offset vectors. I think Arrow went its own way with null-value (Drill's > is-set) vectors. So, some conversion might be a no-op, others might need to > rewrite a vector. Good thing, this is purely at the vector level, so would > be easy to write. The next issue is the one that Parth has long pointed > out: Drill and Arrow each have their own memory allocators. How could we > share a data vector between the two? The simplest initial solution is just > to copy the data from Drill to Arrow. Slow, but transparent to the client. > A crude first-approximation of the development steps: > >> > >> A crude first-approximation of the development steps: > >> 1. Create the client shell server. > >> 2. Implement the Arrow client protocol. Need some way to accept a query > and return batches of results. > >> 3. Forward the query to Drill using the native Drill client. > >> 4. As a first pass, copy vectors from Drill to Arrow and return them to > the client. > >> 5. Then, solve that memory allocator problem to pass data without > copying. > > > > One point that Paul made was that these pieces are fairly discrete and > could be implemented without refactoring major components of Drill. Of > course, this could be something for Drill 2.0. At a minimum, could we take > the conversation off of the PR and put it in the email list? ;-) > > > > Let's discuss... All ideas are welcome! > > > > Best, > > -- C > > > > > > [1]: https://github.com/apache/drill/pull/2412 < > https://github.com/apache/drill/pull/2412> > > [2]: https://lists.apache.org/thread/hcmygrv8q8jyw8p57fm9qy3vw2kqfr5l < > https://lists.apache.org/thread/hcmygrv8q8jyw8p57fm9qy3vw2kqfr5l> > > > > > > > >
