Gandiva 's Project does not allocate any more memory to execute. It just calculates the input memory data whatever they are var-length or fixed-width. The output memory will also be allocated by the Drill ahead which needs to be fixed-width vectors. The var-width output vector cases should not be allowed the Gandiva to evaluate since that will need Gandiva to allocate additional memory which is not controlled by the JVM.
I guess that's why Gandiva does not implement operator like HashJoin or HashAggregate which need to allocate additional memory to implement. But Arrow's WIP PR ARROW-3191 https://github.com/apache/arrow/pull/4151 will make that possible. On Tue, Apr 23, 2019 at 7:15 AM Parth Chandra <[email protected]> wrote: > Is there a way to provide Drill's memory allocator to Gandiva/Arrow? If > not, then how do we keep a proper accounting of any memory used by > Gandiva/Arrow? > > On Sat, Apr 20, 2019 at 7:05 PM Paul Rogers <[email protected]> > wrote: > > > Hi Weijie, > > > > Thanks much for the explanation. Sounds like you are making good > progress. > > > > > > For which operator is the filter pushed into the scan? Although Impala > > does this for all scans, AFAIK, Drill does not do so. For example, the > text > > and JSON reader do not handle filtering. Filtering is instead done by the > > Filter operator in these cases. Perhaps you have your own special scan > > which handles filtering? > > > > > > The concern in DRILL-6340 was the user might do a project operation that > > causes the output batch to be much larger than the input batch. Someone > > suggested flatten as one example. String concatenation is another > example. > > The input batch might be large. The result of the concatenation could be > > too large for available memory. So, the idea was to project the single > > input batch into two (or more) output batches to control batch size. > > > > > > II like how you've categorized the vectors into the set that Gandiva can > > project, and the set that Drill must handle. Maybe you can extend this > idea > > for the case where input batches are split into multiple output batches. > > > > Let Drill handle VarChar expressions that could increase column width > > (such as the concatenate operator.) Let Drill decide the number of rows > in > > the output batch. Then, for the columns that Gandiva can handle, project > > just those rows needed for the current output batch. > > > > Your solution might also be extended to handle the Gandiva library issue. > > Since you are splitting vectors into the Drill group and the Gandiva > group, > > if Drill runs on a platform without Gandiva support, or if the Gandiva > > library can't be found, just let all vectors fall into the Drill vector > > group. > > > > If the user wants to use Gandiva, he/she could set a config option to > > point to the Gandiva library (and supporting files, if any.) Or, use the > > existing LD_LIBRARY_PATH env. variable. > > > > Thanks, > > - Paul > > > > > > > > On Thursday, April 18, 2019, 11:45:08 PM PDT, weijie tong < > > [email protected]> wrote: > > > > Hi Paul: > > Currently Gandiva only supports Project ,Filter operations. My work is to > > integrate Project operator. Since most of the Filter operator will be > > pushed down to the Scan. > > > > The Gandiva project interface works at the RecordBatch level. It accepts > > the memory address of the vectors of input RecordBatch and . Before that > > it also need to construct a binary schema object to describe the input > > RecordBatch schema. > > > > The integration work mainly has two parts: > > 1. at the setup step, find the expressions which can be solved by the > > Gandiva . The matched expression will be solved by the Gandiva, others > will > > still be solved by Drill. > > 2. invoking the Gandiva native project method. The matched expressions' > > ValueVectors will all be allocated corresponding Arrow type null > > representation ValueVector. The null input vector's bit will also be > set. > > The same work will also be done to the output ValueVectors, transfer the > > arrow output null vector to Drill's null vector. Since the native method > > only care the physical memory address, invoking that native method is > not a > > hard work. > > > > Since my current implementation is before DRILL-6340, it does not solve > the > > output size of the project which is less than the input size case. To > cover > > that case , there's some more work to do which I have not focused on. > > > > To contribute to community , there's also some test case problem which > > needs to be considered, since the Gandiva jar is platform dependent. > > > > > > > > > > On Fri, Apr 19, 2019 at 8:43 AM Paul Rogers <[email protected]> > > wrote: > > > > > Hi Weijie, > > > > > > Thanks much for the update on your Gandiva work. It is great work. > > > > > > Can you say more about how you are doing the integration? > > > > > > As you mentioned the memory layout of Arrow's null vector differs from > > the > > > "is set" vector in Drill. How did you work around that? > > > > > > The Project operator is pretty simple if we are just copying or > removing > > > columns. However, much of Project deals with invoking Drill-provided > > > functions: simple ones (add two ints) and complex ones (perform a regex > > > match). To be useful, the integration would have to mimic Drill's > > behavior > > > for each of these many functions. > > > > > > Project currently works row-by-row. But, to get the maximum > performance, > > > it would work column-by-column to take full advantage of vectorization. > > > Doing that would require large changes to the code that sets up > codegen, > > > and iterates over the batch. > > > > > > > > > For operators such as Sort, the only vector-based operations are 1) > sort > > a > > > batch using defined keys to get an offset vector, and 2) create a new > > > vector by copying values, row-by-row, from one batch to another > according > > > to the offset vector. > > > > > > The join and aggregate operations are even more complex, as are the > > > partition senders and receivers. > > > > > > Can you tell us where you've used Gandiva? Which operators? How did you > > > handle the function integration? I am very curious how you were able to > > > solve these problems. > > > > > > > > > Thanks, > > > > > > - Paul > > > > > > > > > > > > On Wednesday, April 3, 2019, 11:51:34 PM PDT, weijie tong < > > > [email protected]> wrote: > > > > > > HI : > > > > > > Gandiva is a sub project of Arrow. Arrow gandiva using LLVM codegen and > > > simd skill could achieve better query performance. Arrow and Drill has > > > similar column memory format. The main difference now is the null > > > representation. Also Arrow has made great changes to the ValueVector. > To > > > adopt Arrow to replace Drill's VV has been discussed before. That would > > be > > > a great job. But to leverage gandiva , by working at the physical > memory > > > address level , this work could be little relatively. > > > > > > Now I have done the integration work at our own branch by make some > > changes > > > to the Arrow branch, and issued DRILL-7087 and ARROW-4819. The main > > changes > > > to ARROW-4819 is to make some package level method to be public. But > > arrow > > > community seems not plan to accept this change. Their advice is to > have a > > > arrow branch. > > > > > > So what do you think? > > > > > > 1、Have a self branch of Arrow. > > > 2、waiting for the Arrow integration completely. > > > or some other ideas? >
