[
https://issues.apache.org/jira/browse/IMPALA-10650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17351550#comment-17351550
]
ASF subversion and git services commented on IMPALA-10650:
----------------------------------------------------------
Commit b50d60a6c5b6fdd182dfc851841edae5cd1b3943 in impala's branch
refs/heads/master from Qifan Chen
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=b50d60a ]
IMPALA-10650: Bailout min/max filters in hash join builder early
This change set addresses the weakness in population min/max filters
in the hash join builder by periodically measuring the usefulness of
each filter and set the 'always_true_' flag accordingly. Once set to
true, the insertion to such a filter completely skips the steps from
the evaluation of the value from a row to the verification of the
value in the min/max range. This optimization is LLVM-enabled.
In addition, a new flag 'is_min_max_value_present' is added to
TRuntimeFilterTargetDesc to indicate whether the min/max column stats
is present in the query plan. The flag eliminates the need to check
the presence of min/max stats for every row in back-end.
Early bail out improves the HJ builder step in general. For example,
the step for join node #11 in TPCDS Q8 improves 13%, and the step
for join node #8 in TPCDS Q16 improves 3.2%.
The Insert() methods are optimized with branch prediction compiler
hints which yield the following improvement when tested with the
insertion of 10000 randomly generated items.
Small Integers: 7.0%
Integers: 4.1%
Big Integers: 4.3%
Strings: 5.6%
Dates: 4.4%
Timestamps: 10.7%
Decimals(4): 10.4%
Decimals(8): 9.1%
In addition, the min/max stats for pages are read in batches with a
fast track version for column types of int32_t, int64_t, float,
double and date that have identical storage format as Parquet. For a
row group, the page locations are read only once, instead of once for
every page skipped, resulting in 100x speedup when a subset of 199
pages are skipped.
Testing:
1. Ran core test successfully;
2. Ran TPCDS performance tests.
Change-Id: I193646e7acfdd3023f7c947d8107da58a1f41183
Reviewed-on: http://gerrit.cloudera.org:8080/17295
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>
> Bail out min/max filters in hash join builder early
> ----------------------------------------------------
>
> Key: IMPALA-10650
> URL: https://issues.apache.org/jira/browse/IMPALA-10650
> Project: IMPALA
> Issue Type: Improvement
> Reporter: Qifan Chen
> Assignee: Qifan Chen
> Priority: Major
>
> Currently, a mechanism is in place to set a min/max filter to always true
> (not useful) after all batches of rows are inserted into the hash table,
> utilizing the column stats. While quite helpful, the mechanism does not
> exploit the property that the same not useful state can be reached as soon as
> several batches are inserted.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]