Hello Tim Armstrong,
I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/2783
to look at the new patch set (#6).
Change subject: IMPALA-3077: Enable runtime filters when PHJ spills
......................................................................
IMPALA-3077: Enable runtime filters when PHJ spills
This patch changes when runtime filters are produced in the partitioned
hash-join node to allow filters to be produced even when the PHJ
spills. Filters are now produced during the level0 processing of the
PHJ's build-side input in ProcessBuildBatch().
Since this function is codegen'ed, so now is filter production. We use
constant-propagation via constant argument injection to disable filter
production at no cost when it is not needed (including in level1+
repartitioning). I inspected the IR to confirm that the constant
propagation works as expected.
This change also allows us to send filters earlier during build-side
processing. A tradeoff is that filters are still built even if the
expected FP rate is too high, although any too-permissive filters are
still not sent to the scan (see 'Performance impact' below).
The restriction that prevented filters from being computed inside a
sub-plan is removed as part of this cleanup (since the FE handles
assigning filters correctly in subplans), and a test is added to confirm
that one of the correct cases for filters in subplans works.
This patch also fixes a bug where re-partitioning beyond level0 would
not use the codegen'ed implementation of ProcessBuildBatch().
A new test is added to test_runtime_row_filters, for Parquet only, which
spills and confirms that filtering still occurs.
Finally, the legacy --enable_phj_probe_side_filtering /
--enable_probe_side_filtering flags have been deprecated, as runtime
filtering can be permanently disabled via setting
RUNTIME_FILTER_MODE=OFF. The implementation that the old flags referred
to has been removed.
Performance impact
------------------
We benchmark the performance loss due to always computing runtime
filters even when the FP-rate will turn out to be too high as follows:
select STRAIGHT_JOIN count(*) from (select id from functional.alltypes
LIMIT 1) a JOIN [BROADCAST] (select * FROM p LIMIT 100000000) b on a.id
= -b.id and b.part_col > 0
('p' is a two-column Parquet table with 1B rows).
This builds a 100M row build table (benchmarks run on one node). When
filtering is enabled, the filter is built but selects all rows from the
probe side (so that there's no benefit to having the filter, to
emphasise the cost of building the filter in the first place).
RUNTIME_FILTER_MODE Avg. time (s) over 5 runs
OFF 18.95
GLOBAL 19.55
-------------------------------
Change +3%
Change-Id: I59a2d9ee03ccea6b674392584e4c7f272233571e
---
M be/src/exec/blocking-join-node.cc
M be/src/exec/blocking-join-node.h
M be/src/exec/hash-join-node.cc
M be/src/exec/partitioned-hash-join-node-ir.cc
M be/src/exec/partitioned-hash-join-node.cc
M be/src/exec/partitioned-hash-join-node.h
M testdata/workloads/functional-query/queries/QueryTest/runtime_row_filters.test
7 files changed, 130 insertions(+), 122 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/Impala refs/changes/83/2783/6
--
To view, visit http://gerrit.cloudera.org:8080/2783
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I59a2d9ee03ccea6b674392584e4c7f272233571e
Gerrit-PatchSet: 6
Gerrit-Project: Impala
Gerrit-Branch: cdh5-trunk
Gerrit-Owner: Henry Robinson <[email protected]>
Gerrit-Reviewer: Dan Hecht <[email protected]>
Gerrit-Reviewer: Henry Robinson <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>