Vuk Ercegovac has posted comments on this change. ( http://gerrit.cloudera.org:8080/8775 )
Change subject: IMPALA-4993: extend dictionary filtering to collections ...................................................................... Patch Set 1: (19 comments) thx for the reviews. http://gerrit.cloudera.org:8080/#/c/8775/1/be/src/exec/hdfs-parquet-scanner.h File be/src/exec/hdfs-parquet-scanner.h: http://gerrit.cloudera.org:8080/#/c/8775/1/be/src/exec/hdfs-parquet-scanner.h@452 PS1, Line 452: Only scalar columns can be : /// dictionary filtered > Can you mention in the comment why? Done. re-worded http://gerrit.cloudera.org:8080/#/c/8775/1/be/src/exec/hdfs-scan-node-base.h File be/src/exec/hdfs-scan-node-base.h: http://gerrit.cloudera.org:8080/#/c/8775/1/be/src/exec/hdfs-scan-node-base.h@167 PS1, Line 167: return thrift_dict_filter_conjuncts_; > You could make the calling code cleaner by returning an empty vector here. its used in one place so don't think its worth it. http://gerrit.cloudera.org:8080/#/c/8775/1/be/src/exec/hdfs-scan-node-base.h@382 PS1, Line 382: /// Dictionary filtering eligible conjuncts for each slot. > Can you add to the comment that this can be nullptr? Done http://gerrit.cloudera.org:8080/#/c/8775/1/be/src/exec/hdfs-scanner.cc File be/src/exec/hdfs-scanner.cc: http://gerrit.cloudera.org:8080/#/c/8775/1/be/src/exec/hdfs-scanner.cc@90 PS1, Line 90: auto > We usually only use auto for iterators and const& in for loops. I think thi Done http://gerrit.cloudera.org:8080/#/c/8775/1/be/src/exec/hdfs-scanner.cc@97 PS1, Line 97: entry.tuple_id > you could just use 'tuple_id' here (from L86). Done http://gerrit.cloudera.org:8080/#/c/8775/1/common/thrift/PlanNodes.thrift File common/thrift/PlanNodes.thrift: http://gerrit.cloudera.org:8080/#/c/8775/1/common/thrift/PlanNodes.thrift@202 PS1, Line 202: between a TupleId, SlotId pair to a > nit: I think it should be "mapping between ... and ..." or "mapping from .. Done http://gerrit.cloudera.org:8080/#/c/8775/1/common/thrift/PlanNodes.thrift@209 PS1, Line 209: optional > Shouldn't they all be required? Done http://gerrit.cloudera.org:8080/#/c/8775/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: http://gerrit.cloudera.org:8080/#/c/8775/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@187 PS1, Line 187: // Map from pairs of TupleDescriptor, SlotIds to indices in PlanNodes.conjuncts_ and > I think this comment could benefit from a description of what makes a slot/ min-max pruning is discussed in the header, but dictionary pruning is not. I added a few sentences there so that its less surprising when seeing these supporting data-structures for the first time. http://gerrit.cloudera.org:8080/#/c/8775/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@188 PS1, Line 188: collectionConjuncts > nit: missing _ Done http://gerrit.cloudera.org:8080/#/c/8775/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@188 PS1, Line 188: // collectionConjuncts that are eligible for dictionary filtering. The TupleDescriptor > nit: I'd phrase it as "slots in the TupleDescriptor of this scan node map t Done http://gerrit.cloudera.org:8080/#/c/8775/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@624 PS1, Line 624: Preconditions.checkState(tupleIds.size() == 1); > It doesn't seem obvious to me why this is true. Can you add a comment? A bit unclear to me as well. However, if I understand this literally, we only support single slot conjuncts for dictionary pruning, so therefore should expect that that single slot only refers to a single tuple. I reduced the multiple checks against slot id and added a comment. http://gerrit.cloudera.org:8080/#/c/8775/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@656 PS1, Line 656: * Walks through conjuncts and collectionConjuncts and populates > nit: trailing _s Done http://gerrit.cloudera.org:8080/#/c/8775/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@668 PS1, Line 668: notEmptyCollections_.contains(entry.getKey()) > This could be factored into its own method together with the comment above. Done http://gerrit.cloudera.org:8080/#/c/8775/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1010 PS1, Line 1010: for (Map.Entry<Pair<TupleDescriptor, SlotId>, List<Integer>> entry : dictionaryFilterConjuncts_.entrySet()) { > long line Done http://gerrit.cloudera.org:8080/#/c/8775/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1075 PS1, Line 1075: if (!dictionaryFilterConjuncts_.isEmpty()) { > This check looks like it's not needed. whoops... missed the clear simplification. Done. http://gerrit.cloudera.org:8080/#/c/8775/1/testdata/workloads/functional-query/queries/QueryTest/parquet-filtering.test File testdata/workloads/functional-query/queries/QueryTest/parquet-filtering.test: http://gerrit.cloudera.org:8080/#/c/8775/1/testdata/workloads/functional-query/queries/QueryTest/parquet-filtering.test@a2 PS1, Line 2: > Why is this not needed anymore? I saw in history that it was the case at one point, but the mirroring has since been removed, so this comment is stale. http://gerrit.cloudera.org:8080/#/c/8775/1/testdata/workloads/functional-query/queries/QueryTest/parquet-filtering.test@265 PS1, Line 265: # - number of projections and their nesting depth > I suspect this comment may rot if we keep it here, but to me it looks like I think there's some benefit to explicitly listing the areas tested/to-test so that we can see expectations and coverage in-line. If expansion is required here due to extra cases or bugs, I'd find it harder to figure out the scope from multiple commit messages. http://gerrit.cloudera.org:8080/#/c/8775/1/testdata/workloads/functional-query/queries/QueryTest/parquet-filtering.test@275 PS1, Line 275: 1 > Why is only one row group filtered out here and not 2? there are two files in the table. one of them is dictionary encoded for int_map key (so its filtered) and the other is plain encoded (so is not filtered). when I look at the plain encoded one, I see only one value, so perhaps the idea is not to use dictionary encoding when its not worth it? http://gerrit.cloudera.org:8080/#/c/8775/1/testdata/workloads/functional-query/queries/QueryTest/parquet-filtering.test@374 PS1, Line 374: # Array struct value at depth 3. Bottom not required. No matches. Multiple nested values projected. > nit: long lines Done -- To view, visit http://gerrit.cloudera.org:8080/8775 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: If3a2abcfc3d0f7d18756816659fed77ce12668dd Gerrit-Change-Number: 8775 Gerrit-PatchSet: 1 Gerrit-Owner: Vuk Ercegovac <vercego...@cloudera.com> Gerrit-Reviewer: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Vuk Ercegovac <vercego...@cloudera.com> Gerrit-Comment-Date: Mon, 11 Dec 2017 20:47:53 +0000 Gerrit-HasComments: Yes