Re: [Discuss] Integrate Arrow gandiva into Drill

Parth Chandra Mon, 22 Apr 2019 16:15:33 -0700

Is there a way to provide Drill's memory allocator to Gandiva/Arrow? If
not, then how do we keep a proper accounting of any memory used by
Gandiva/Arrow?


On Sat, Apr 20, 2019 at 7:05 PM Paul Rogers <[email protected]>
wrote:

> Hi Weijie,
>
> Thanks much for the explanation. Sounds like you are making good progress.
>
>
> For which operator is the filter pushed into the scan? Although Impala
> does this for all scans, AFAIK, Drill does not do so. For example, the text
> and JSON reader do not handle filtering. Filtering is instead done by the
> Filter operator in these cases. Perhaps you have your own special scan
> which handles filtering?
>
>
> The concern in DRILL-6340 was the user might do a project operation that
> causes the output batch to be much larger than the input batch. Someone
> suggested flatten as one example. String concatenation is another example.
> The input batch might be large. The result of the concatenation could be
> too large for available memory. So, the idea was to project the single
> input batch into two (or more) output batches to control batch size.
>
>
> II like how you've categorized the vectors into the set that Gandiva can
> project, and the set that Drill must handle. Maybe you can extend this idea
> for the case where input batches are split into multiple output batches.
>
>  Let Drill handle VarChar expressions that could increase column width
> (such as the concatenate operator.) Let Drill decide the number of rows in
> the output batch. Then, for the columns that Gandiva can handle, project
> just those rows needed for the current output batch.
>
> Your solution might also be extended to handle the Gandiva library issue.
> Since you are splitting vectors into the Drill group and the Gandiva group,
> if Drill runs on a platform without Gandiva support, or if the Gandiva
> library can't be found, just let all vectors fall into the Drill vector
> group.
>
> If the user wants to use Gandiva, he/she could set a config option to
> point to the Gandiva library (and supporting files, if any.) Or, use the
> existing LD_LIBRARY_PATH env. variable.
>
> Thanks,
> - Paul
>
>
>
>     On Thursday, April 18, 2019, 11:45:08 PM PDT, weijie tong <
> [email protected]> wrote:
>
>  Hi Paul:
> Currently Gandiva only supports Project ,Filter operations. My work is to
> integrate Project operator. Since most of the Filter operator will be
> pushed down to the Scan.
>
> The Gandiva project interface works at the RecordBatch level. It accepts
> the memory address of the vectors of  input RecordBatch and . Before that
> it also need to construct a binary schema object to describe the input
> RecordBatch schema.
>
> The integration work mainly has two parts:
>   1. at the setup step, find the expressions which can be solved by the
> Gandiva . The matched expression will be solved by the Gandiva, others will
> still be solved by Drill.
>   2. invoking the Gandiva native project method. The matched expressions'
> ValueVectors will all be allocated corresponding Arrow type null
> representation ValueVector. The null input vector's bit  will also be set.
> The same work will also be done to the output ValueVectors, transfer the
> arrow output null vector to Drill's null vector. Since the native method
> only care the physical memory address, invoking that native method is not a
> hard work.
>
> Since my current implementation is before DRILL-6340, it does not solve the
> output size of the project which is less than the input size case. To cover
> that case , there's some more work to do which I have not focused on.
>
> To contribute to community , there's also some test case problem which
> needs to be considered, since the Gandiva jar is platform dependent.
>
>
>
>
> On Fri, Apr 19, 2019 at 8:43 AM Paul Rogers <[email protected]>
> wrote:
>
> > Hi Weijie,
> >
> > Thanks much for the update on your Gandiva work. It is great work.
> >
> > Can you say more about how you are doing the integration?
> >
> > As you mentioned the memory layout of Arrow's null vector differs from
> the
> > "is set" vector in Drill. How did you work around that?
> >
> > The Project operator is pretty simple if we are just copying or removing
> > columns. However, much of Project deals with invoking Drill-provided
> > functions: simple ones (add two ints) and complex ones (perform a regex
> > match). To be useful, the integration would have to mimic Drill's
> behavior
> > for each of these many functions.
> >
> > Project currently works row-by-row. But, to get the maximum performance,
> > it would work column-by-column to take full advantage of vectorization.
> > Doing that would require large changes to the code that sets up codegen,
> > and iterates over the batch.
> >
> >
> > For operators such as Sort, the only vector-based operations are 1) sort
> a
> > batch using defined keys to get an offset vector, and 2) create a new
> > vector by copying values, row-by-row, from one batch to another according
> > to the offset vector.
> >
> > The join and aggregate operations are even more complex, as are the
> > partition senders and receivers.
> >
> > Can you tell us where you've used Gandiva? Which operators? How did you
> > handle the function integration? I am very curious how you were able to
> > solve these problems.
> >
> >
> > Thanks,
> >
> > - Paul
> >
> >
> >
> >    On Wednesday, April 3, 2019, 11:51:34 PM PDT, weijie tong <
> > [email protected]> wrote:
> >
> >  HI :
> >
> > Gandiva is a sub project of Arrow. Arrow gandiva using LLVM codegen and
> > simd skill could achieve better query performance.  Arrow and Drill has
> > similar column memory format. The main difference now is the null
> > representation. Also Arrow has made great changes to the ValueVector. To
> > adopt Arrow to replace Drill's VV has been discussed before. That would
> be
> > a great job. But to leverage gandiva , by working at the physical memory
> > address level , this work could be little relatively.
> >
> > Now I have done the integration work at our own branch by make some
> changes
> > to the Arrow branch, and issued DRILL-7087 and ARROW-4819. The main
> changes
> > to ARROW-4819 is to make some package level method to be public. But
> arrow
> > community seems not plan to accept this change. Their advice is to have a
> > arrow branch.
> >
> > So what do you think?
> >
> > 1、Have a self branch of Arrow.
> > 2、waiting for the Arrow integration completely.
> > or some other ideas?

Re: [Discuss] Integrate Arrow gandiva into Drill

Reply via email to