Hi Paul: Currently Gandiva only supports Project ,Filter operations. My work is to integrate Project operator. Since most of the Filter operator will be pushed down to the Scan.
The Gandiva project interface works at the RecordBatch level. It accepts the memory address of the vectors of input RecordBatch and . Before that it also need to construct a binary schema object to describe the input RecordBatch schema. The integration work mainly has two parts: 1. at the setup step, find the expressions which can be solved by the Gandiva . The matched expression will be solved by the Gandiva, others will still be solved by Drill. 2. invoking the Gandiva native project method. The matched expressions' ValueVectors will all be allocated corresponding Arrow type null representation ValueVector. The null input vector's bit will also be set. The same work will also be done to the output ValueVectors, transfer the arrow output null vector to Drill's null vector. Since the native method only care the physical memory address, invoking that native method is not a hard work. Since my current implementation is before DRILL-6340, it does not solve the output size of the project which is less than the input size case. To cover that case , there's some more work to do which I have not focused on. To contribute to community , there's also some test case problem which needs to be considered, since the Gandiva jar is platform dependent. On Fri, Apr 19, 2019 at 8:43 AM Paul Rogers <[email protected]> wrote: > Hi Weijie, > > Thanks much for the update on your Gandiva work. It is great work. > > Can you say more about how you are doing the integration? > > As you mentioned the memory layout of Arrow's null vector differs from the > "is set" vector in Drill. How did you work around that? > > The Project operator is pretty simple if we are just copying or removing > columns. However, much of Project deals with invoking Drill-provided > functions: simple ones (add two ints) and complex ones (perform a regex > match). To be useful, the integration would have to mimic Drill's behavior > for each of these many functions. > > Project currently works row-by-row. But, to get the maximum performance, > it would work column-by-column to take full advantage of vectorization. > Doing that would require large changes to the code that sets up codegen, > and iterates over the batch. > > > For operators such as Sort, the only vector-based operations are 1) sort a > batch using defined keys to get an offset vector, and 2) create a new > vector by copying values, row-by-row, from one batch to another according > to the offset vector. > > The join and aggregate operations are even more complex, as are the > partition senders and receivers. > > Can you tell us where you've used Gandiva? Which operators? How did you > handle the function integration? I am very curious how you were able to > solve these problems. > > > Thanks, > > - Paul > > > > On Wednesday, April 3, 2019, 11:51:34 PM PDT, weijie tong < > [email protected]> wrote: > > HI : > > Gandiva is a sub project of Arrow. Arrow gandiva using LLVM codegen and > simd skill could achieve better query performance. Arrow and Drill has > similar column memory format. The main difference now is the null > representation. Also Arrow has made great changes to the ValueVector. To > adopt Arrow to replace Drill's VV has been discussed before. That would be > a great job. But to leverage gandiva , by working at the physical memory > address level , this work could be little relatively. > > Now I have done the integration work at our own branch by make some changes > to the Arrow branch, and issued DRILL-7087 and ARROW-4819. The main changes > to ARROW-4819 is to make some package level method to be public. But arrow > community seems not plan to accept this change. Their advice is to have a > arrow branch. > > So what do you think? > > 1、Have a self branch of Arrow. > 2、waiting for the Arrow integration completely. > or some other ideas?
