Henry Robinson has posted comments on this change. Change subject: IMPALA-3007: Adjust Bloom Filter size according to NDV estimate ......................................................................
Patch Set 4: (6 comments) http://gerrit.cloudera.org:8080/#/c/2812/4/be/src/exec/old-hash-table.cc File be/src/exec/old-hash-table.cc: Line 143: int64_t filter_size = state_->filter_bank()->GetFilterSizeForNdv( > i think it's best to compute the size during initialization of the filter c Done http://gerrit.cloudera.org:8080/#/c/2812/4/be/src/exec/partitioned-hash-join-node.cc File be/src/exec/partitioned-hash-join-node.cc: Line 522: ctx.filter->filter_desc().ndv_estimate); > why not record the actual size in the filter context. that way, you can't h It's a bit more convenient to put it in the RuntimeFilter itself (that way the RuntimeFilterBank can set it without the caller having to do so). Line 524: bool fp_rate_too_high = > don't we check this periodically, or is that only done in the parquet scann Only in the parquet scanner. Not clear if there'd be an advantage to doing it periodically here as doing so would add a branch back to the filter building path. http://gerrit.cloudera.org:8080/#/c/2812/4/fe/src/main/java/com/cloudera/impala/planner/DistributedPlanner.java File fe/src/main/java/com/cloudera/impala/planner/DistributedPlanner.java: Line 518: // estimates may have changed as well. > looking at this again, i'm not sure i follow that argument. in what way hav After further digging in, I think this is a bug. With a join-node that has a union as build input and a limit on the union, the cardinality estimate after single-node planning (when the filters are originally generated) is incorrect. It does not apply the limit. After fragmentation (and the insertion of an xchg into the build input) the cardinality corrects itself. See IMPALA-3450 for more details. http://gerrit.cloudera.org:8080/#/c/2812/4/testdata/workloads/functional-query/queries/QueryTest/runtime_row_filters_phj.test File testdata/workloads/functional-query/queries/QueryTest/runtime_row_filters_phj.test: Line 6: # consumption / spilling behaviour. > move that last sentence to the very top, as a separate comment, because it Not sure exactly where you mean: this is the only query this applies to. (The nested-types query below is here because the old HJ implementation doesn't work with nested-types). Line 26: > reign in the blank lines Done -- To view, visit http://gerrit.cloudera.org:8080/2812 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I1fe37b8d4cfb3c52bb8e8cf0ca55e92665b87803 Gerrit-PatchSet: 4 Gerrit-Project: Impala Gerrit-Branch: cdh5-trunk Gerrit-Owner: Henry Robinson <[email protected]> Gerrit-Reviewer: Henry Robinson <[email protected]> Gerrit-Reviewer: Marcel Kornacker <[email protected]> Gerrit-Reviewer: Mostafa Mokhtar <[email protected]> Gerrit-HasComments: Yes
