Alex Behm has posted comments on this change. ( http://gerrit.cloudera.org:8080/8775 )
Change subject: IMPALA-4993: extend dictionary filtering to collections ...................................................................... Patch Set 8: (22 comments) http://gerrit.cloudera.org:8080/#/c/8775/8/be/src/exec/hdfs-parquet-scanner.h File be/src/exec/hdfs-parquet-scanner.h: http://gerrit.cloudera.org:8080/#/c/8775/8/be/src/exec/hdfs-parquet-scanner.h@420 PS8, Line 420: /// dictionary filtering. The definition/repetition levels are are populated by the are are http://gerrit.cloudera.org:8080/#/c/8775/8/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: http://gerrit.cloudera.org:8080/#/c/8775/8/be/src/exec/hdfs-parquet-scanner.cc@778 PS8, Line 778: } Did you intend to add the collection_reader to non_filtering? I think deeply nested CollectionColumnReaders might never be added to non_dict_filterable_readers_ Maybe that's ok. Might be easier to completely ignore collection column readers since those are unconditionally never filterable. Might also be harder, not sure, your call. http://gerrit.cloudera.org:8080/#/c/8775/8/be/src/exec/hdfs-parquet-scanner.cc@809 PS8, Line 809: if (!state_->query_options().parquet_dictionary_filtering || dict_filter_map_.empty()) { Looks like non_dict_filterable_readers_ is inconsistently populated. With dictionary filtering enabled it would also include all nested column readers that are not filterable, but with dictionary filtering disabled it only includes the top-level column readers. I assume that's not really a problem, but still seems inconsistent. http://gerrit.cloudera.org:8080/#/c/8775/8/be/src/exec/hdfs-parquet-scanner.cc@845 PS8, Line 845: if (tuple_desc == nullptr) tuple_desc = scan_node_->tuple_desc(); Why this check? I believe all SlotDescriptors require a parent. We even DHCECK that in the SlotDescriptor c'tor. If so, we can just remove this function http://gerrit.cloudera.org:8080/#/c/8775/8/be/src/exec/hdfs-scanner.h File be/src/exec/hdfs-scanner.h: http://gerrit.cloudera.org:8080/#/c/8775/8/be/src/exec/hdfs-scanner.h@222 PS8, Line 222: typedef std::map<std::pair<TupleId, SlotId>, std::vector<ScalarExprEvaluator*>> Why is the key a pair and not just SlotId or a SlotDescriptor*? http://gerrit.cloudera.org:8080/#/c/8775/8/common/thrift/PlanNodes.thrift File common/thrift/PlanNodes.thrift: http://gerrit.cloudera.org:8080/#/c/8775/8/common/thrift/PlanNodes.thrift@245 PS8, Line 245: 9: optional list<TDictFilter> dictionary_filter_conjuncts Not sure I understand why this needed to change. SlotIds are unique, so what benefit do we get from adding the tuple id? The TupleId can be derived from the SlotId through the DescriptorTbl. http://gerrit.cloudera.org:8080/#/c/8775/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/8775/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@115 PS8, Line 115: * each value present in the row group's dictionary ... prune a row group by evaluating the conjuncts against the respective column dictionaries. (there's no such thing as "the row group's dictionary" right?) http://gerrit.cloudera.org:8080/#/c/8775/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@200 PS8, Line 200: private Map<Pair<TupleDescriptor, SlotId>, List<Integer>> dictionaryFilterConjuncts_ = Unfortunate that we have to use a Pair as key since conceptually the pair has redundant information (slot and tuple are 1:1) Maybe a SlotDescriptor would work as key instead? http://gerrit.cloudera.org:8080/#/c/8775/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@543 PS8, Line 543: * These filters are added by analysis (see: SelectStmt#registerIsNotEmptyPredicates). Move this comment to notEmptyCollections_ http://gerrit.cloudera.org:8080/#/c/8775/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@548 PS8, Line 548: private boolean isCollectionNotEmpty(TupleDescriptor desc) { Why do we need this function? Can't we just check notEmptyCollections_.contains(desc) inline? http://gerrit.cloudera.org:8080/#/c/8775/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@638 PS8, Line 638: private void addDictionaryFilter(Analyzer analyzer, Expr conjunct, int conjunct_idx) { conjunctIdx http://gerrit.cloudera.org:8080/#/c/8775/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@641 PS8, Line 641: remove newline (these 3 lines kind of belong together) http://gerrit.cloudera.org:8080/#/c/8775/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@675 PS8, Line 675: Pair<TupleDescriptor, SlotId> slotKey = If you agree to simplify the map key to SlotDescriptor, then I think we can remove tupleIds from this function entirely. The Preconditions check in L647 is not that useful because that's a fundamental guarantee of our design. http://gerrit.cloudera.org:8080/#/c/8775/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@677 PS8, Line 677: if (dictionaryFilterConjuncts_.containsKey(slotKey)) { We typically use this more idiomatic pattern to avoid duplicate lookups (same LOC): List<Integer> slotList = dictionaryFilterConjuncts_.get(slotKey); if (slotList == null) { slotList = Lists.newArrayList(); dictionaryFilterConjuncts_.put(slotKey, slotList); } slotList.add(conjunctIdx); http://gerrit.cloudera.org:8080/#/c/8775/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@690 PS8, Line 690: for (int conjunct_idx = 0; conjunct_idx < conjuncts_.size(); ++conjunct_idx) { conjunctIdx here and below http://gerrit.cloudera.org:8080/#/c/8775/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1039 PS8, Line 1039: f.setTuple_id(entry.getKey().first.getId().asInt()); The tuple id redundant. Can we avoid sending it? http://gerrit.cloudera.org:8080/#/c/8775/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1102 PS8, Line 1102: // Groups the dictionary filterable conjuncts by tuple descriptor. Let's add this into a helper function. http://gerrit.cloudera.org:8080/#/c/8775/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1116 PS8, Line 1116: List<Integer> totalIdxList = Lists.newArrayList(entry.getValue()); why a new arraylist? http://gerrit.cloudera.org:8080/#/c/8775/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1121 PS8, Line 1121: List<Expr> exprList = Lists.newArrayList(); move closer to where it's used (before L1132) http://gerrit.cloudera.org:8080/#/c/8775/8/testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test File testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test: http://gerrit.cloudera.org:8080/#/c/8775/8/testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test@330 PS8, Line 330: parquet statistics predicates: c_custkey > 0, o.o_orderkey > 0, l.l_partkey > 0 inconsistent explain-plan printing: the stats predicates are all clumped together, but the dictionary predicates are grouped by tuple Ok to defer the fix if you prefer, but I think we should address the inconsistency. http://gerrit.cloudera.org:8080/#/c/8775/8/testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test@385 PS8, Line 385: l.l_receiptdate = '1994-08-24' and l.l_shipmode = 'RAIL' and l.l_returnflag = 'R' also add a predicate that is true when the slot is NULL http://gerrit.cloudera.org:8080/#/c/8775/8/testdata/workloads/functional-query/queries/QueryTest/parquet-filtering.test File testdata/workloads/functional-query/queries/QueryTest/parquet-filtering.test: http://gerrit.cloudera.org:8080/#/c/8775/8/testdata/workloads/functional-query/queries/QueryTest/parquet-filtering.test@272 PS8, Line 272: select id from functional_parquet.complextypestbl f, f.int_map m where m.key = 'k5' If not already tested elsewhere, it might be worth adding an equivalent test like this just to make sure it works: select id from functional_parquet.complextypestbl.int_map m where m.key = 'k5' -- To view, visit http://gerrit.cloudera.org:8080/8775 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If3a2abcfc3d0f7d18756816659fed77ce12668dd Gerrit-Change-Number: 8775 Gerrit-PatchSet: 8 Gerrit-Owner: Vuk Ercegovac <vercego...@cloudera.com> Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com> Gerrit-Reviewer: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Vuk Ercegovac <vercego...@cloudera.com> Gerrit-Comment-Date: Thu, 11 Jan 2018 05:47:27 +0000 Gerrit-HasComments: Yes