Hello Paul, I totally agree that integrating Arrow by simply replacing Vectors usage everywhere will cause a disaster. After the first look at the new *E*nhanced*V*ector*F*ramework and based on your suggestions I think I have an idea to share. In my opinion, the integration can be done in the two major stages:
*1. Preparation Stage* 1.1 Extract all EVF and related components to a separate module. So the new separate module will depend only upon Vectors module. 1.2 Step-by-step rewriting of all operators to use a higher-level EVF module and remove Vectors module from exec and modules dependencies. 1.3 Ensure that only module which depends on Vectors is the new EVF one. *2. Integration Stage* 2.1 Add dependency on Arrow Vectors module into EVF module. 2.2 Replace all usages of Drill Vectors & Protobuf Meta with Arrow Vectors & Flatbuffers Meta in EVF module. 2.3 Finalize integration by removing Drill Vectors module completely. *NOTE:* I think that any way we won't preserve any backward compatibility for drivers and custom UDFs. And proposed changes are a major step forward to be included in Drill 2.0 version. Below is the very first list of packages that in future may be transformed into EVF module: *Module:* exec/Vectors *Packages:* org.apache.drill.exec.record.metadata - (An enhanced set of classes to describe a Drill schema.) org.apache.drill.exec.record.metadata.schema.parser org.apache.drill.exec.vector.accessor - (JSON-like readers and writers for each kind of Drill vector.) org.apache.drill.exec.vector.accessor.convert org.apache.drill.exec.vector.accessor.impl org.apache.drill.exec.vector.accessor.reader org.apache.drill.exec.vector.accessor.writer org.apache.drill.exec.vector.accessor.writer.dummy *Module:* exec/Java Execution Engine *Packages:* org.apache.drill.exec.physical.rowSet - (Record batches management) org.apache.drill.exec.physical.resultSet - (Enhanced rowSet with memory mgmt) org.apache.drill.exec.physical.impl.scan - (Row set based scan) Thanks, Igor Guzenko On Mon, Dec 9, 2019 at 8:53 PM Paul Rogers <par0...@yahoo.com.invalid> wrote: > Hi All, > > Would be good to do some design brainstorming around this. > > Integration with other tools depends on the APIs (the first two items I > mentioned.) Last time I checked (more than a year ago), memory layout of > Arrow is close to that in Drill; so conversion is around "packaging" and > metadata, which can be encapsulated in an API. > > Converting internals is a major undertaking. We have large amounts of > complex, critical code that works directly with the details of value > vectors. My thought was to first convert code to use the column > readers/writers we've developed. Then, once all internal code uses that > abstraction, we can replace the underlying vector implementation with > Arrow. This lets us work in small stages, each of which is deliverable by > itself. > > The other approach is to change all code that works directly with Drill > vectors to instead work with Arrow. Because that code is so detailed and > fragile, that is a huge, risky project. > > There are other approaches as well. Would be good to explore them before > we dive into a major project. > > Thanks, > - Paul > > > > On Monday, December 9, 2019, 07:07:31 AM PST, Charles Givre < > cgi...@gmail.com> wrote: > > Hi Igor, > That would be really great if you could see that through to completion. > IMHO, the value from this is not so much performance related but rather the > ability to use Drill to gather and prep data and seamlessly "hand it off" > to other platforms for machine learning. > -- C > > > > On Dec 9, 2019, at 5:48 AM, Igor Guzenko <ihor.huzenko....@gmail.com> > wrote: > > > > Hello Nai and Paul, > > > > I would like to contribute full Apache Arrow integration. > > > > Thanks, > > Igor > > > > On Mon, Dec 9, 2019 at 8:56 AM Paul Rogers <par0...@yahoo.com.invalid> > > wrote: > > > >> Hi Nai Yan, > >> > >> Integration is still in the discussion stages. Work has been progressing > >> on some foundations which would help that integration. > >> > >> At the Developer's Day we talked about several ways to integrate. These > >> include: > >> > >> 1. A storage plugin to read Arrow buffers from some source so that you > >> could use Arrow data in a Drill query. > >> > >> 2. A new Drill client API that produces Arrow buffers from a Drill query > >> so that an Arrow-based tool can consume Arrow data from Drill. > >> > >> 3. Replacement of the Drill value vectors internally with Arrow buffers. > >> > >> The first two are relatively straightforward; they just need someone to > >> contribute an implementation. The third is a major long-term project > >> because of the way Drill value vectors and Arrow vectors have diverged. > >> > >> > >> I wonder, which of these use cases is of interest to you? How might you > >> use that integration in you project? > >> > >> > >> Thanks, > >> - Paul > >> > >> > >> > >> On Sunday, December 8, 2019, 10:33:23 PM PST, Nai Yan. < > >> zhaon...@gmail.com> wrote: > >> > >> Greetings, > >> As mentioned in Drill develper Day 2018, there's a plan for Drill > to > >> integrate Arrow (gandiva from Dremio). I was wondering how is going. > >> > >> Thanks in adavance. > >> > >> > >> > >> Nai Yan > >> >