Hi Igor, Bingo! You clearly explained the idea that has been simmering since starting on the "Row Set" framework (which evolved into EVF.) Some of the early ideas are in [1]. At one of the Drill Developer Days, there was brief discussion about the approach you propose: creating an API on top of our memory layout, modifying operators to use that API, then changing the underlying memory representation. You've done a nice job spelling out some of the steps involved.
Though this is a new idea in Drill, it is not a unique concept in the industry. As it turns out, Presto already does this. Presto is row-based, with rows grouped into batches. (By contrast, Drill is column-based, with columns grouped into batches.) Presto supports multiple row formats: as Java objects and, it seems, as arrays or direct memory. Point is, Presto provides a row access API that hides memory layout from consumers. Drill, being vector based, is far more complex than Presto, however. Vectors have to be pre-allocated and decisions made in one vector need to be reflected in the others. (That is, if we have two columns: image ID int and image data varbinary, the image data column will get huge fast, but the image ID will say small.) Column-based memory layout is Drill's big selling point, but it leads to code that is VERY complex compared to the row-based code in Presto. Thus, the Drill vector API ends up being more complicated than Presto's row based API. As you noted, the column readers and writers (part of EVF), try to handle this complexity. More to say, will follow up in separate notes. Perhaps, we can create a document somewhere to capture these ideas. (Markdown in the Wiki attached to the Drill project is one possibility; easy to track and share.) Then, if the team agrees, we can coordinate the incremental efforts done thus far. Thanks, - Paul [1] https://github.com/paul-rogers/drill/wiki/BH-Future-Work On Wednesday, January 8, 2020, 10:02:43 AM PST, Igor Guzenko <ihor.huzenko....@gmail.com> wrote: Hello Paul, I totally agree that integrating Arrow by simply replacing Vectors usage everywhere will cause a disaster. After the first look at the new *E*nhanced*V*ector*F*ramework and based on your suggestions I think I have an idea to share. In my opinion, the integration can be done in the two major stages: *1. Preparation Stage* 1.1 Extract all EVF and related components to a separate module. So the new separate module will depend only upon Vectors module. 1.2 Step-by-step rewriting of all operators to use a higher-level EVF module and remove Vectors module from exec and modules dependencies. 1.3 Ensure that only module which depends on Vectors is the new EVF one. *2. Integration Stage* 2.1 Add dependency on Arrow Vectors module into EVF module. 2.2 Replace all usages of Drill Vectors & Protobuf Meta with Arrow Vectors & Flatbuffers Meta in EVF module. 2.3 Finalize integration by removing Drill Vectors module completely. *NOTE:* I think that any way we won't preserve any backward compatibility for drivers and custom UDFs. And proposed changes are a major step forward to be included in Drill 2.0 version. Below is the very first list of packages that in future may be transformed into EVF module: *Module:* exec/Vectors *Packages:* org.apache.drill.exec.record.metadata - (An enhanced set of classes to describe a Drill schema.) org.apache.drill.exec.record.metadata.schema.parser org.apache.drill.exec.vector.accessor - (JSON-like readers and writers for each kind of Drill vector.) org.apache.drill.exec.vector.accessor.convert org.apache.drill.exec.vector.accessor.impl org.apache.drill.exec.vector.accessor.reader org.apache.drill.exec.vector.accessor.writer org.apache.drill.exec.vector.accessor.writer.dummy *Module:* exec/Java Execution Engine *Packages:* org.apache.drill.exec.physical.rowSet - (Record batches management) org.apache.drill.exec.physical.resultSet - (Enhanced rowSet with memory mgmt) org.apache.drill.exec.physical.impl.scan - (Row set based scan) Thanks, Igor Guzenko On Mon, Dec 9, 2019 at 8:53 PM Paul Rogers <par0...@yahoo.com.invalid> wrote: > Hi All, > > Would be good to do some design brainstorming around this. > > Integration with other tools depends on the APIs (the first two items I > mentioned.) Last time I checked (more than a year ago), memory layout of > Arrow is close to that in Drill; so conversion is around "packaging" and > metadata, which can be encapsulated in an API. > > Converting internals is a major undertaking. We have large amounts of > complex, critical code that works directly with the details of value > vectors. My thought was to first convert code to use the column > readers/writers we've developed. Then, once all internal code uses that > abstraction, we can replace the underlying vector implementation with > Arrow. This lets us work in small stages, each of which is deliverable by > itself. > > The other approach is to change all code that works directly with Drill > vectors to instead work with Arrow. Because that code is so detailed and > fragile, that is a huge, risky project. > > There are other approaches as well. Would be good to explore them before > we dive into a major project. > > Thanks, > - Paul > >