Re: Min/Max runtime filtering on Impala-Kudu

Matthew Jacobs Mon, 27 Mar 2017 14:30:11 -0700

Thanks for writing this up, Sailesh. It sounds reasonable.


On Mon, Mar 27, 2017 at 2:24 PM, Sailesh Mukil <[email protected]> wrote:
> On Mon, Mar 27, 2017 at 11:49 AM, Marcel Kornacker <[email protected]>
> wrote:
>
>> On Mon, Mar 27, 2017 at 11:42 AM, Sailesh Mukil <[email protected]>
>> wrote:
>> > I will be working on a patch to add min/max filter support in Impala, and
>> > as a first step, specifically target the KuduScanNode, since the Kudu
>> > client is already able to accept a Min and a Max that it would internally
>> > use to filter during its scans. Below is a brief design proposal.
>> >
>> > *Goal:*
>> >
>> > To leverage runtime min/max filter support in Kudu for the potential
>> speed
>> > up of queries over Kudu tables. Kudu does this by taking a min and a max
>> > that Impala will provide and only return values in the range Impala is
>> > interested in.
>> >
>> > *[min <= range we're interested in >= max]*
>> >
>> > *Proposal:*
>> >
>> >
>> >    - As a first step, plumb the runtime filter code from
>> > *exec/hdfs-scan-node-base.cc/h
>> >    <http://hdfs-scan-node-base.cc/h>* to *exec/scan-node.cc/h
>> >    <http://scan-node.cc/h>*, so that it can be applied to *KuduScanNode*
>> >    cleanly as well, since *KuduScanNode* and *HdfsScanNodeBase* both
>> >    inherit from *ScanNode.*
>>
>> Quick comment: please make sure your solution also applies to
>> KuduScanNodeMt.
>>
>
> Thanks for the input, I'll make sure to do that.
>
>
>>
>> >    - Reuse the *ColumnStats* class (exec/parquet-column-stats.h) or
>> >    implement a lighter weight version of it to process and store the Min
>> and
>> >    the Max on the build side of the join.
>> >    - Once the Min and Max values are added to the existing runtime filter
>> >    structures, as a first step, we will ignore the Min and Max values for
>> >    non-Kudu tables. Using them for non-Kudu tables can come in as a
>> following
>> >    patch(es).
>> >    - Similarly, the bloom filter will be ignored for Kudu tables, and
>> only
>> >    the Min and Max values will be used, since Kudu does not accept bloom
>> >    filters yet. (https://issues.apache.org/jira/browse/IMPALA-3741)
>> >    - Applying the bloom filter on the Impala side of the Kudu scan (i.e.
>> in
>> >    KuduScanNode) is not in the scope of this patch.
>> >
>> >
>> > *Complications:*
>> >
>> >    - We have to make sure that finding the Min and Max values on the
>> build
>> >    side doesn't regress certain workloads, since the difference between
>> >    generating a bloom filter and generating a Min and a Max, is that a
>> bloom
>> >    filter can be type agnostic (we just take a raw hash over the data)
>> whereas
>> >    a Min and a Max have to be type specific.
>>

Re: Min/Max runtime filtering on Impala-Kudu

Reply via email to