Hi Igor, Before diving into design issues, it may be worthwhile to think about the premise: should Drill adopt Arrow as its internal memory layout? This is the question that the team has wrestled with since Arrow was launched. Arrow has three parts. Let's think about each.
First is a direct memory layout. The approach you suggest will let us work with the Arrow memory format. Use EVF to access vectors, then the underlying vectors can be swapped from Drill to Arrow. But, what is the advantage of using Arrow? The arrow layout isn't better than Drill's; it is just different. Adopting the Arrow memory layout by itself provides little benefit, but bit cost. This is one reason the team has been so reluctant to atop Arrow. The only advantage of using the Arrow memory layout is if Drill could benefit from code written for Arrow. The second part of Arrow is a set of modules to manipulate vectors. Gandiva is the most prominent example. However, there are major challenges. Most SQL operations are defined to work on rows; some clever thinking will be needed to convert those operations into a series of column operations. (Drill's codegen is NOT columnar: it works row-by-row.) So, if we want to benefit from Gandiva, we must completely rethink how we process batches. Is it worth doing all that work? The primary benefit would be performance. But, it is not clear that our current implementation is the bottleneck. The current implementation is row-based, code generated in Java. Would be great for someone to do some benchmarks to show the benefit from adopting Gandiva to see if the potential gain justifies the likely large development cost. The third advantage of using Arrow is to allow exchange of vectors between Drill and Arrow-based clients or readers. As it turns out, this is not the big win it seems. As we've discussed, we could easily create an Arrow-based client for Drill -- there will be an RPC between the client and Drill and we can use that to do format conversion. For readers, Drill will want control over batch sizes; Drill cannot blindly accept whatever size vectors a reader chooses to produce. (More on that later.) Incoming data will be subject to projection and selection, so it will quickly move out of the incoming Arrow vectors into vector which Drill creates. Arrow gets (or got) a lot of press. However, our job is to focus on what's best for Drill. There actually might be a memory layout for Drill that is better than Arrow (and better than our current vectors.) A couple of us did a prototype some time ago that seemed to show promise. So, it is not clear that adopting Arrow is necessarily a huge win: maybe it is, maybe not. We need to figure it out. What IS clearly a huge win is the idea you outlined: creating a layer between memory layout and the rest of Drill so that we can try out different memory layouts to see what works best. Thanks, - Paul On Wednesday, January 8, 2020, 10:02:43 AM PST, Igor Guzenko <ihor.huzenko....@gmail.com> wrote: Hello Paul, I totally agree that integrating Arrow by simply replacing Vectors usage everywhere will cause a disaster. After the first look at the new *E*nhanced*V*ector*F*ramework and based on your suggestions I think I have an idea to share. In my opinion, the integration can be done in the two major stages: *1. Preparation Stage* 1.1 Extract all EVF and related components to a separate module. So the new separate module will depend only upon Vectors module. 1.2 Step-by-step rewriting of all operators to use a higher-level EVF module and remove Vectors module from exec and modules dependencies. 1.3 Ensure that only module which depends on Vectors is the new EVF one. *2. Integration Stage* 2.1 Add dependency on Arrow Vectors module into EVF module. 2.2 Replace all usages of Drill Vectors & Protobuf Meta with Arrow Vectors & Flatbuffers Meta in EVF module. 2.3 Finalize integration by removing Drill Vectors module completely. *NOTE:* I think that any way we won't preserve any backward compatibility for drivers and custom UDFs. And proposed changes are a major step forward to be included in Drill 2.0 version. Below is the very first list of packages that in future may be transformed into EVF module: *Module:* exec/Vectors *Packages:* org.apache.drill.exec.record.metadata - (An enhanced set of classes to describe a Drill schema.) org.apache.drill.exec.record.metadata.schema.parser org.apache.drill.exec.vector.accessor - (JSON-like readers and writers for each kind of Drill vector.) org.apache.drill.exec.vector.accessor.convert org.apache.drill.exec.vector.accessor.impl org.apache.drill.exec.vector.accessor.reader org.apache.drill.exec.vector.accessor.writer org.apache.drill.exec.vector.accessor.writer.dummy *Module:* exec/Java Execution Engine *Packages:* org.apache.drill.exec.physical.rowSet - (Record batches management) org.apache.drill.exec.physical.resultSet - (Enhanced rowSet with memory mgmt) org.apache.drill.exec.physical.impl.scan - (Row set based scan) Thanks, Igor Guzenko