[
https://issues.apache.org/jira/browse/IMPALA-4252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Thomas Tauber-Marshall resolved IMPALA-4252.
--------------------------------------------
Resolution: Fixed
Fix Version/s: Impala 2.11.0
commit 2510fe0aa0c86f460af9040eb413aad76c13cc84
Author: Thomas Tauber-Marshall <[email protected]>
Date: Mon Oct 23 07:58:34 2017 -0700
IMPALA-4252: Min-max runtime filters for Kudu
This patch implements min-max filters for runtime filters. Each
runtime filter generates a bloom filter or a min-max filter,
depending on if it has HDFS or Kudu targets, respectively.
In RuntimeFilterGenerator in the planner, each hash join node
generates a bloom and min-max filter for each equi-join predicate, but
only those filters that end up being assigned to a target make it into
the final plan.
Min-max filters are only assigned to Kudu scans if the target expr is
a column, as Kudu doesn't support bounds on general exprs, and only if
the join op is '=' and not 'is distinct from', as Kudu doesn't support
returning NULLs if a bound is set.
Min-max filters are inserted into by the PartitionedHashJoinBuilder.
Codegen is used to eliminate branching on the type of filter. String
min-max filters truncate their bounds at 1024 chars, so that the max
amount of memory used by min-max filters is negligible.
For now, min-max filters are only applied at the KuduScanner, which
passes them into the Kudu client.
Future work will address applying min-max filters at HDFS scan nodes
and applying bloom filters at Kudu scan nodes.
Functional Testing:
- Added new planner tests and updated the old ones. (in old tests, a
lot of runtime filters are renumbered as we always generate min-max
filters even if they don't end up getting assigned and they take up
some of the RF ids).
- Updated existing runtime filter tests to work with Kudu.
- Added e2e tests for min-max filter specific functionality.
Perf Testing:
- All tests run on Kudu stress cluster (10 nodes) and tpch_100_kudu,
timings are averages of 3 runs.
- Ran a contrived query with a filter that does not eliminate any rows
(full self join of lineitem). The difference in running time was
negligible - 24.46s with filters on, 24.15s with filters off for
a ~1% slowdown.
- Ran a contrived query with a filter that elimiates all rows (self
join on lineitem with a join condition that never matches). The
filters resulted in a significant speedup - 0.26s with filters on,
1.46s with filters off for a ~5.6x speedup. This query is added to
targeted-perf.
Change-Id: I02bad890f5b5f78388a3041bf38f89369b5e2f1c
Reviewed-on: http://gerrit.cloudera.org:8080/7793
Reviewed-by: Thomas Tauber-Marshall <[email protected]>
Tested-by: Impala Public Jenkins
> Add RuntimeFilters for "in list" and/or min/max at KuduScanNode
> ---------------------------------------------------------------
>
> Key: IMPALA-4252
> URL: https://issues.apache.org/jira/browse/IMPALA-4252
> Project: IMPALA
> Issue Type: New Feature
> Components: Backend
> Affects Versions: Kudu_Impala
> Reporter: Matthew Jacobs
> Assignee: Thomas Tauber-Marshall
> Labels: kudu, performance, runtime-filters
> Fix For: Impala 2.11.0
>
>
> Kudu scans will benefit significantly from runtime filters. For now, we could
> generate an 'in list' and/or min/max values instead of bloom filters to push
> to the KuduScanNode and, with support from the Kudu client
> ([KUDU-1683|https://issues.apache.org/jira/browse/KUDU-1683]), we could push
> them to Kudu during execution. At some point, it would be nice to push bloom
> filters to Kudu (IMPALA-3741), but that will require more work and should be
> a potential follow-up task to this.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)