Joe McDonnell has uploaded a new patch set (#14). Change subject: IMPALA-4624: Implement Parquet dictionary filtering ......................................................................
IMPALA-4624: Implement Parquet dictionary filtering Here is a basic summary of the changes: Frontend looks for conjuncts that operate on a single slot and pass a map from slot id to the conjunct index through thrift to the backend. The conjunct indices are the incides into the normal PlanNode conjuncts list. The conjuncts need to satisfy certain conditions: 1. They are bound on a single slot 2. They are deterministic (no random functions) 3. They evaluate to FALSE on a NULL input. This is because the dictionary does not include NULLs, so any condition that evaluates to TRUE on NULL cannot be evaluated by looking only at the dictionary. The backend converts the indices into ExprContexts. These are cloned in the scanner threads. The dictionary read codepath has been removed from ReadDataPage into its own function, InitDictionary. This has also been turned into its own step in row group initialization. ReadDataPage will not see any dictionary pages unless the parquet file is invalid. For dictionary filtering, we initialize dictionaries only as needed to evaluate the conjuncts. The Parquet scanner evaluates the dictionary filter conjuncts on the dictionary to see if any dictionary entry passes. If no entry passes, the row group is eliminated. If the row group passes the dictionary filtering, then we initialize all remaining dictionaries. Since column chunks can have a mixture of encodings, dictionary filtering uses three tests to determine whether this is purely dictionary encoded: 1. If the encoding_stats is in the parquet file, then use it to determine if there are only dictionary encoded pages (i.e. there are no data pages with an encoding other than PLAIN_DICTIONARY). -OR- 2. If the encoding stats are not present, then look at the encodings. The column is purely dictionary encoded if: a) PLAIN_DICTIONARY is present AND b) Only PLAIN_DICTIONARY, RLE, or BIT_PACKED encodings are listed -OR- 3. If this file was written by an older version of Impala, then we know that dictionary failover happens when the dictionary reaches 40,000 values. Dictionary filtering can proceed as long as the dictionary is smaller than that. parquet-mr writes the encoding list correctly in the current version in our environment (1.5.0). This means that check #2 works on some existing files (potentially most existing parquet-mr files). parquet-mr writes the encoding stats starting in 1.9.0. This is the version where check #1 will start working. Impala's parquet writer now implements both, so either check above will work. Change-Id: I3a7cc3bd0523fbf3c79bd924219e909ef671cfd7 --- M be/src/exec/hdfs-parquet-scanner.cc M be/src/exec/hdfs-parquet-scanner.h M be/src/exec/hdfs-parquet-table-writer.cc M be/src/exec/hdfs-scan-node-base.cc M be/src/exec/hdfs-scan-node-base.h M be/src/exec/hdfs-scanner.cc M be/src/exec/hdfs-scanner.h M be/src/exec/parquet-column-readers.cc M be/src/exec/parquet-column-readers.h M be/src/service/query-options.cc M be/src/service/query-options.h M be/src/util/dict-encoding.h M be/src/util/dict-test.cc M common/thrift/ImpalaInternalService.thrift M common/thrift/ImpalaService.thrift M common/thrift/PlanNodes.thrift M common/thrift/parquet.thrift M fe/src/main/java/org/apache/impala/analysis/Expr.java M fe/src/main/java/org/apache/impala/analysis/FunctionCallExpr.java M fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java M fe/src/test/java/org/apache/impala/planner/PlannerTest.java M testdata/workloads/functional-planner/queries/PlannerTest/constant-folding.test M testdata/workloads/functional-planner/queries/PlannerTest/mt-dop-validation.test A testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test A testdata/workloads/functional-query/queries/QueryTest/mt-dop-parquet-filtering.test A testdata/workloads/functional-query/queries/QueryTest/parquet-filtering.test M tests/query_test/test_scanners.py 27 files changed, 1,417 insertions(+), 187 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/04/5904/14 -- To view, visit http://gerrit.cloudera.org:8080/5904 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: newpatchset Gerrit-Change-Id: I3a7cc3bd0523fbf3c79bd924219e909ef671cfd7 Gerrit-PatchSet: 14 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Joe McDonnell <[email protected]> Gerrit-Reviewer: Alex Behm <[email protected]> Gerrit-Reviewer: Joe McDonnell <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Marcel Kornacker <[email protected]> Gerrit-Reviewer: Matthew Mulder <[email protected]> Gerrit-Reviewer: Mostafa Mokhtar <[email protected]>
