Thanks for writing this up, Sailesh. It sounds reasonable.
On Mon, Mar 27, 2017 at 2:24 PM, Sailesh Mukil <[email protected]> wrote: > On Mon, Mar 27, 2017 at 11:49 AM, Marcel Kornacker <[email protected]> > wrote: > >> On Mon, Mar 27, 2017 at 11:42 AM, Sailesh Mukil <[email protected]> >> wrote: >> > I will be working on a patch to add min/max filter support in Impala, and >> > as a first step, specifically target the KuduScanNode, since the Kudu >> > client is already able to accept a Min and a Max that it would internally >> > use to filter during its scans. Below is a brief design proposal. >> > >> > *Goal:* >> > >> > To leverage runtime min/max filter support in Kudu for the potential >> speed >> > up of queries over Kudu tables. Kudu does this by taking a min and a max >> > that Impala will provide and only return values in the range Impala is >> > interested in. >> > >> > *[min <= range we're interested in >= max]* >> > >> > *Proposal:* >> > >> > >> > - As a first step, plumb the runtime filter code from >> > *exec/hdfs-scan-node-base.cc/h >> > <http://hdfs-scan-node-base.cc/h>* to *exec/scan-node.cc/h >> > <http://scan-node.cc/h>*, so that it can be applied to *KuduScanNode* >> > cleanly as well, since *KuduScanNode* and *HdfsScanNodeBase* both >> > inherit from *ScanNode.* >> >> Quick comment: please make sure your solution also applies to >> KuduScanNodeMt. >> > > Thanks for the input, I'll make sure to do that. > > >> >> > - Reuse the *ColumnStats* class (exec/parquet-column-stats.h) or >> > implement a lighter weight version of it to process and store the Min >> and >> > the Max on the build side of the join. >> > - Once the Min and Max values are added to the existing runtime filter >> > structures, as a first step, we will ignore the Min and Max values for >> > non-Kudu tables. Using them for non-Kudu tables can come in as a >> following >> > patch(es). >> > - Similarly, the bloom filter will be ignored for Kudu tables, and >> only >> > the Min and Max values will be used, since Kudu does not accept bloom >> > filters yet. (https://issues.apache.org/jira/browse/IMPALA-3741) >> > - Applying the bloom filter on the Impala side of the Kudu scan (i.e. >> in >> > KuduScanNode) is not in the scope of this patch. >> > >> > >> > *Complications:* >> > >> > - We have to make sure that finding the Min and Max values on the >> build >> > side doesn't regress certain workloads, since the difference between >> > generating a bloom filter and generating a Min and a Max, is that a >> bloom >> > filter can be type agnostic (we just take a raw hash over the data) >> whereas >> > a Min and a Max have to be type specific. >>
