Zach Amsden has uploaded a new patch set (#7). Change subject: IMPALA-4864 Speed up single slot predicates with dictionaries ......................................................................
IMPALA-4864 Speed up single slot predicates with dictionaries When dictionaries are present we can pre-evaluate conjuncts against the dictionary values and simply look up the result. Status of this diff: Compiles and starts. Bitmap tests for new functionality pass. Tests are broken due to some inadvertent change in the parquet column reader that causing file decode to break. Needs debugging, and there are definitely some bugs, but the exposition of the concept is now fully formed. Basic idea: since we codegen so early, before we know enough details about the columns to know if they are dict filterable, if we do have dictionary filtering predicates, we codegen a guard around each dictionary filterable predicate evaluation. This guard skips evaluation of the predicate if it has already been evaluated by the dictionary. In this way, we can skip evaluation dynamically for each row group as we learn which columns are dictionary filterable, and then push predicate evaluation into the column reader. Since the branches will remain 100% predictable over the row group, this should give us the fastest way to skip over predicate evaluation without compromising the general case where we may be unable to evaluate against the dictionary. We can even do this with codegen turned off, as a side effect of the way we generate the codegen'd function when dictionary evaluation is enabled. If dictionaries aren't available for some predicates, we automatically fall back to evaluating those predicates in the original order, preserving selectivity. The overhead in this case is a perfectly predictable extra conditional per dictionary predicate. We could codegen another version of the EvalConjuncts function without this overhead, but because of the complexity involved in doing so and the pain involved (ScanNodeBase assumes one codegen'd function per file format, so we would have to simulate a file format or some other awful hack). Change-Id: I65981c89e5292086809ec1268f5a273f4c1fe054 --- M be/src/codegen/gen_ir_descriptions.py M be/src/common/global-flags.cc M be/src/exec/exec-node.cc M be/src/exec/exec-node.h M be/src/exec/hash-join-node.cc M be/src/exec/hdfs-avro-scanner.cc M be/src/exec/hdfs-parquet-scanner-ir.cc M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/hdfs-parquet-scanner.h M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-readers.h M be/src/exec/parquet-scratch-tuple-batch.h M be/src/exec/partitioned-hash-join-node.cc M be/src/util/bitmap-test.cc M be/src/util/bitmap.h M be/src/util/dict-encoding.h M testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test 19 files changed, 561 insertions(+), 188 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/26/6726/7 -- To view, visit http://gerrit.cloudera.org:8080/6726 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I65981c89e5292086809ec1268f5a273f4c1fe054 Gerrit-PatchSet: 7 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Zach Amsden <[email protected]> Gerrit-Reviewer: Joe McDonnell <[email protected]>
