Hi Igor, You mentioned EVF. For those who are newer to the project, let me recap the history of EVF and how it fits into the Arrow picture.
The original idea of value vectors was to create long blocks of data that we can load into the CPU cache, then apply operations upon without CPU cache misses. That is, load a vector of ints, then use CPU vector instructions to sum those ints. Somehow, this lead to the belief that bigger vectors are better. Some years ago, we found we were creating vectors of size 16MB, 32MB or larger. (In fact, there is no need for vectors larger than the CPU cache size, but that is another discussion.) Drill uses Netty for memory allocation. Turns out, however, that we run into memory fragmentation issues when vectors grow in size beyond 16 MB. So, we need to limit vector (and thus batch) size. Record readers (part of the Scan operator), however, are in an awkward position: they have no idea the size of the data that they read. Most just read, say 4K rows. 4K ints is 16K of memory, which is fine. Maybe make the vectors bigger? Maybe 64K rows of 4 bytes each is better, the vector is 256K and this is the size of the CPU cache. Perfect! But, what if we are loading large text values? Maybe each row is 1 MB in size. In this case, we can fit only 16 (not 16K, just 16) VARCHARs into a 16 MB vector. But, how the heck does the reader know that ahead of time? The ResultSetLoader (part of EVF) is designed to handle this: it writes data until the vector fills, then does an "overflow shuffle" to hide all the messy details from the reader. The reader just asks "please sir, may I write another?" and the EVF answers either "yes" or "no." To accomplish this goal, we ended up needing to wrap most vector functionality currently spread over generated code, the vectors themselves, accessors, mutators, complex writers, complex readers and more. All these operations need to occur in the column readers and writers because they all revolve around the current row index (or array index when dealing with nested data.) BTW: EVF consists of two parts. The column writers (and the awkwardly named "ResultSetLoader") does the vector writing. Matcher readers handle access to existing vectors. The rest of EVF does scan-specific work such as projection, implicit columns, missing columns, data translation, iterating over readers and more. Only the column readers and writers are relevant to operators other than Scan. So, the result of solving the memory fragmentation issue is that we now have a near-complete API on top of vectors independent of the memory layout and the value vector implementation. (There is a bit more work, in flight, to allow bulk copies.) This is why our focus with EVF thus far has been the readers: this is where the memory fragmentation issue is the worst. CSV is done. JSON is in flight. Very glad to see Arina's work converting Avro, and Charles' work on his format and storage plugins. And, of course, your and others help in reviewing the supporting PRs. Others on the team created good interim solutions in the other operators via the "batch sizing" work. Still, the hope was that, eventually, the other operators would use the column writers for a unified, common solution. So, this leads to your suggestions about the other operators. More on that later. Thanks, - Paul On Wednesday, January 8, 2020, 10:02:43 AM PST, Igor Guzenko <ihor.huzenko....@gmail.com> wrote: Hello Paul, I totally agree that integrating Arrow by simply replacing Vectors usage everywhere will cause a disaster. After the first look at the new *E*nhanced*V*ector*F*ramework and based on your suggestions I think I have an idea to share. In my opinion, the integration can be done in the two major stages: *1. Preparation Stage* 1.1 Extract all EVF and related components to a separate module. So the new separate module will depend only upon Vectors module. 1.2 Step-by-step rewriting of all operators to use a higher-level EVF module and remove Vectors module from exec and modules dependencies. 1.3 Ensure that only module which depends on Vectors is the new EVF one. *2. Integration Stage* 2.1 Add dependency on Arrow Vectors module into EVF module. 2.2 Replace all usages of Drill Vectors & Protobuf Meta with Arrow Vectors & Flatbuffers Meta in EVF module. 2.3 Finalize integration by removing Drill Vectors module completely. *NOTE:* I think that any way we won't preserve any backward compatibility for drivers and custom UDFs. And proposed changes are a major step forward to be included in Drill 2.0 version. Below is the very first list of packages that in future may be transformed into EVF module: *Module:* exec/Vectors *Packages:* org.apache.drill.exec.record.metadata - (An enhanced set of classes to describe a Drill schema.) org.apache.drill.exec.record.metadata.schema.parser org.apache.drill.exec.vector.accessor - (JSON-like readers and writers for each kind of Drill vector.) org.apache.drill.exec.vector.accessor.convert org.apache.drill.exec.vector.accessor.impl org.apache.drill.exec.vector.accessor.reader org.apache.drill.exec.vector.accessor.writer org.apache.drill.exec.vector.accessor.writer.dummy *Module:* exec/Java Execution Engine *Packages:* org.apache.drill.exec.physical.rowSet - (Record batches management) org.apache.drill.exec.physical.resultSet - (Enhanced rowSet with memory mgmt) org.apache.drill.exec.physical.impl.scan - (Row set based scan) Thanks, Igor Guzenko On Mon, Dec 9, 2019 at 8:53 PM Paul Rogers <par0...@yahoo.com.invalid> wrote: > Hi All, > > Would be good to do some design brainstorming around this. > > Integration with other tools depends on the APIs (the first two items I > mentioned.) Last time I checked (more than a year ago), memory layout of > Arrow is close to that in Drill; so conversion is around "packaging" and > metadata, which can be encapsulated in an API. > > Converting internals is a major undertaking. We have large amounts of > complex, critical code that works directly with the details of value > vectors. My thought was to first convert code to use the column > readers/writers we've developed. Then, once all internal code uses that > abstraction, we can replace the underlying vector implementation with > Arrow. This lets us work in small stages, each of which is deliverable by > itself. > > The other approach is to change all code that works directly with Drill > vectors to instead work with Arrow. Because that code is so detailed and > fragile, that is a huge, risky project. > > There are other approaches as well. Would be good to explore them before > we dive into a major project. > > Thanks, > - Paul > > > > On Monday, December 9, 2019, 07:07:31 AM PST, Charles Givre < > cgi...@gmail.com> wrote: > > Hi Igor, > That would be really great if you could see that through to completion. > IMHO, the value from this is not so much performance related but rather the > ability to use Drill to gather and prep data and seamlessly "hand it off" > to other platforms for machine learning. > -- C > > > > On Dec 9, 2019, at 5:48 AM, Igor Guzenko <ihor.huzenko....@gmail.com> > wrote: > > > > Hello Nai and Paul, > > > > I would like to contribute full Apache Arrow integration. > > > > Thanks, > > Igor > > > > On Mon, Dec 9, 2019 at 8:56 AM Paul Rogers <par0...@yahoo.com.invalid> > > wrote: > > > >> Hi Nai Yan, > >> > >> Integration is still in the discussion stages. Work has been progressing > >> on some foundations which would help that integration. > >> > >> At the Developer's Day we talked about several ways to integrate. These > >> include: > >> > >> 1. A storage plugin to read Arrow buffers from some source so that you > >> could use Arrow data in a Drill query. > >> > >> 2. A new Drill client API that produces Arrow buffers from a Drill query > >> so that an Arrow-based tool can consume Arrow data from Drill. > >> > >> 3. Replacement of the Drill value vectors internally with Arrow buffers. > >> > >> The first two are relatively straightforward; they just need someone to > >> contribute an implementation. The third is a major long-term project > >> because of the way Drill value vectors and Arrow vectors have diverged. > >> > >> > >> I wonder, which of these use cases is of interest to you? How might you > >> use that integration in you project? > >> > >> > >> Thanks, > >> - Paul > >> > >> > >> > >> On Sunday, December 8, 2019, 10:33:23 PM PST, Nai Yan. < > >> zhaon...@gmail.com> wrote: > >> > >> Greetings, > >> As mentioned in Drill develper Day 2018, there's a plan for Drill > to > >> integrate Arrow (gandiva from Dremio). I was wondering how is going. > >> > >> Thanks in adavance. > >> > >> > >> > >> Nai Yan > >> >