[jira] [Commented] (IMPALA-10650) Bail out min/max filters in hash join builder early

ASF subversion and git services (Jira) Tue, 25 May 2021 23:41:05 -0700


    [ 
https://issues.apache.org/jira/browse/IMPALA-10650?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17351550#comment-17351550
 ]


ASF subversion and git services commented on IMPALA-10650:
----------------------------------------------------------

Commit b50d60a6c5b6fdd182dfc851841edae5cd1b3943 in impala's branch 
refs/heads/master from Qifan Chen
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=b50d60a ]

IMPALA-10650: Bailout min/max filters in hash join builder early

This change set addresses the weakness in population min/max filters
in the hash join builder by periodically measuring the usefulness of
each filter and set the 'always_true_' flag accordingly. Once set to
true, the insertion to such a filter completely skips the steps from
the evaluation of the value from a row to the verification of the
value in the min/max range. This optimization is LLVM-enabled.

In addition, a new flag 'is_min_max_value_present' is added to
TRuntimeFilterTargetDesc to indicate whether the min/max column stats
is present in the query plan. The flag eliminates the need to check
the presence of min/max stats for every row in back-end.

Early bail out improves the HJ builder step in general. For example,
the step for join node #11 in TPCDS Q8 improves 13%, and the step
for join node #8 in TPCDS Q16 improves 3.2%.

The Insert() methods are optimized with branch prediction compiler
hints which yield the following improvement when tested with the
insertion of 10000 randomly generated items.

  Small Integers: 7.0%
  Integers:       4.1%
  Big Integers:   4.3%
  Strings:        5.6%
  Dates:          4.4%
  Timestamps:    10.7%
  Decimals(4):   10.4%
  Decimals(8):    9.1%

In addition, the min/max stats for pages are read in batches with a
fast track version for column types of int32_t,  int64_t, float,
double and date that have identical storage format as Parquet. For a
row group, the page locations are read only once, instead of once for
every page skipped, resulting in 100x speedup when a subset of 199
pages are skipped.

Testing:
  1. Ran core test successfully;
  2. Ran TPCDS performance tests.

Change-Id: I193646e7acfdd3023f7c947d8107da58a1f41183
Reviewed-on: http://gerrit.cloudera.org:8080/17295
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Bail out min/max filters in hash join builder early 
> ----------------------------------------------------
>
>                 Key: IMPALA-10650
>                 URL: https://issues.apache.org/jira/browse/IMPALA-10650
>             Project: IMPALA
>          Issue Type: Improvement
>            Reporter: Qifan Chen
>            Assignee: Qifan Chen
>            Priority: Major
>
> Currently, a mechanism is in place to set a min/max filter to always true 
> (not useful) after all batches of rows are inserted into the hash table, 
> utilizing the column stats.  While quite helpful, the mechanism does not 
> exploit the property that the same not useful state can be reached as soon as 
> several batches are inserted. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-10650) Bail out min/max filters in hash join builder early

Reply via email to