Vuk Ercegovac has posted comments on this change. ( http://gerrit.cloudera.org:8080/8775 )
Change subject: IMPALA-4993: extend dictionary filtering to collections ...................................................................... Patch Set 8: (23 comments) http://gerrit.cloudera.org:8080/#/c/8775/8/be/src/exec/hdfs-parquet-scanner.h File be/src/exec/hdfs-parquet-scanner.h: http://gerrit.cloudera.org:8080/#/c/8775/8/be/src/exec/hdfs-parquet-scanner.h@420 PS8, Line 420: /// dictionary filtering. The definition/repetition levels are are populated by the > are are Done http://gerrit.cloudera.org:8080/#/c/8775/8/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: http://gerrit.cloudera.org:8080/#/c/8775/8/be/src/exec/hdfs-parquet-scanner.cc@778 PS8, Line 778: } > FWIW, I discovered today I needed to make some similar adjustments to the c will take a look at your change more closely. http://gerrit.cloudera.org:8080/#/c/8775/8/be/src/exec/hdfs-parquet-scanner.cc@778 PS8, Line 778: } > Did you intend to add the collection_reader to non_filtering? no, I intended to omit the collection readers. this partitions all nested scalar readers. http://gerrit.cloudera.org:8080/#/c/8775/8/be/src/exec/hdfs-parquet-scanner.cc@809 PS8, Line 809: if (!state_->query_options().parquet_dictionary_filtering || dict_filter_map_.empty()) { > Looks like non_dict_filterable_readers_ is inconsistently populated. With d yup... I assumed that if its a mix of collection + scalars is ok since that's what works today (perhaps modified by tim's change). And I don't see a problem with flattening down to just the scalar readers. http://gerrit.cloudera.org:8080/#/c/8775/8/be/src/exec/hdfs-parquet-scanner.cc@845 PS8, Line 845: if (tuple_desc == nullptr) tuple_desc = scan_node_->tuple_desc(); > Why this check? I believe all SlotDescriptors require a parent. We even DHC that's correct, removed. I think I carried that conservative check over from the frontend where the parent is not required to be non-null in the ctor. should it be required? http://gerrit.cloudera.org:8080/#/c/8775/8/be/src/exec/hdfs-scanner.h File be/src/exec/hdfs-scanner.h: http://gerrit.cloudera.org:8080/#/c/8775/8/be/src/exec/hdfs-scanner.h@222 PS8, Line 222: typedef std::map<std::pair<TupleId, SlotId>, std::vector<ScalarExprEvaluator*>> > Why is the key a pair and not just SlotId or a SlotDescriptor*? I'll comment here on the <tupleid, slotid> vs slotid decision here (covers several related comments in the thrift file and hdfsscannode). I suspected that slotid is unique across tuples so with your other comments and various other implications baked into various apis, I've gone ahead and made this change. However, with a two-level naming scheme (tuple/slot), I was a bit unclear on this and did not find it stated explicitly. I would've expected this in frontend's TupleDescriptor-- should I add a clarifying blurb there? http://gerrit.cloudera.org:8080/#/c/8775/8/common/thrift/PlanNodes.thrift File common/thrift/PlanNodes.thrift: http://gerrit.cloudera.org:8080/#/c/8775/8/common/thrift/PlanNodes.thrift@245 PS8, Line 245: 9: optional list<TDictFilter> dictionary_filter_conjuncts > Not sure I understand why this needed to change. SlotIds are unique, so wha Done. See comment in hdfs-scanner for why I went down this route. http://gerrit.cloudera.org:8080/#/c/8775/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/8775/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@115 PS8, Line 115: * each value present in the row group's dictionary > ... prune a row group by evaluating the conjuncts against the respective co Done http://gerrit.cloudera.org:8080/#/c/8775/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@200 PS8, Line 200: private Map<Pair<TupleDescriptor, SlotId>, List<Integer>> dictionaryFilterConjuncts_ = > Unfortunate that we have to use a Pair as key since conceptually the pair h Done. See comment in hdfs-scanner for why I went down this route. http://gerrit.cloudera.org:8080/#/c/8775/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@543 PS8, Line 543: * These filters are added by analysis (see: SelectStmt#registerIsNotEmptyPredicates). > Move this comment to notEmptyCollections_ Done http://gerrit.cloudera.org:8080/#/c/8775/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@548 PS8, Line 548: private boolean isCollectionNotEmpty(TupleDescriptor desc) { > Why do we need this function? Can't we just check notEmptyCollections_.cont its used in a couple of places and I wanted to keep the same checks in both (and not repeat the checks). http://gerrit.cloudera.org:8080/#/c/8775/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@638 PS8, Line 638: private void addDictionaryFilter(Analyzer analyzer, Expr conjunct, int conjunct_idx) { > conjunctIdx Done http://gerrit.cloudera.org:8080/#/c/8775/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@641 PS8, Line 641: > remove newline (these 3 lines kind of belong together) Done http://gerrit.cloudera.org:8080/#/c/8775/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@675 PS8, Line 675: Pair<TupleDescriptor, SlotId> slotKey = > If you agree to simplify the map key to SlotDescriptor, then I think we can Done http://gerrit.cloudera.org:8080/#/c/8775/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@677 PS8, Line 677: if (dictionaryFilterConjuncts_.containsKey(slotKey)) { > We typically use this more idiomatic pattern to avoid duplicate lookups (sa Done http://gerrit.cloudera.org:8080/#/c/8775/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@690 PS8, Line 690: for (int conjunct_idx = 0; conjunct_idx < conjuncts_.size(); ++conjunct_idx) { > conjunctIdx here and below Done http://gerrit.cloudera.org:8080/#/c/8775/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1039 PS8, Line 1039: f.setTuple_id(entry.getKey().first.getId().asInt()); > The tuple id redundant. Can we avoid sending it? Done. See comment in hdfs-scanner for why I went down this route. http://gerrit.cloudera.org:8080/#/c/8775/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1102 PS8, Line 1102: // Groups the dictionary filterable conjuncts by tuple descriptor. > Let's add this into a helper function. Done http://gerrit.cloudera.org:8080/#/c/8775/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1116 PS8, Line 1116: List<Integer> totalIdxList = Lists.newArrayList(entry.getValue()); > why a new arraylist? didn't want to mutate a class member for the purpose of printing diags. http://gerrit.cloudera.org:8080/#/c/8775/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1121 PS8, Line 1121: List<Expr> exprList = Lists.newArrayList(); > move closer to where it's used (before L1132) Done http://gerrit.cloudera.org:8080/#/c/8775/8/testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test File testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test: http://gerrit.cloudera.org:8080/#/c/8775/8/testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test@330 PS8, Line 330: parquet statistics predicates: c_custkey > 0, o.o_orderkey > 0, l.l_partkey > 0 > inconsistent explain-plan printing: the stats predicates are all clumped to hmm... actually I followed the pattern on L324-5 which separates the predicates by tuple. http://gerrit.cloudera.org:8080/#/c/8775/8/testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test@385 PS8, Line 385: l.l_receiptdate = '1994-08-24' and l.l_shipmode = 'RAIL' and l.l_returnflag = 'R' > also add a predicate that is true when the slot is NULL Done http://gerrit.cloudera.org:8080/#/c/8775/8/testdata/workloads/functional-query/queries/QueryTest/parquet-filtering.test File testdata/workloads/functional-query/queries/QueryTest/parquet-filtering.test: http://gerrit.cloudera.org:8080/#/c/8775/8/testdata/workloads/functional-query/queries/QueryTest/parquet-filtering.test@272 PS8, Line 272: select id from functional_parquet.complextypestbl f, f.int_map m where m.key = 'k5' > If not already tested elsewhere, it might be worth adding an equivalent tes Done -- To view, visit http://gerrit.cloudera.org:8080/8775 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If3a2abcfc3d0f7d18756816659fed77ce12668dd Gerrit-Change-Number: 8775 Gerrit-PatchSet: 8 Gerrit-Owner: Vuk Ercegovac <[email protected]> Gerrit-Reviewer: Alex Behm <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Reviewer: Vuk Ercegovac <[email protected]> Gerrit-Comment-Date: Thu, 11 Jan 2018 20:08:32 +0000 Gerrit-HasComments: Yes
