[Impala-ASF-CR] IMPALA-4993: extend dictionary filtering to collections

Vuk Ercegovac (Code Review) Mon, 11 Dec 2017 12:57:41 -0800

Vuk Ercegovac has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8775 )


Change subject: IMPALA-4993: extend dictionary filtering to collections
......................................................................


Patch Set 1:

(19 comments)

thx for the reviews.

http://gerrit.cloudera.org:8080/#/c/8775/1/be/src/exec/hdfs-parquet-scanner.h
File be/src/exec/hdfs-parquet-scanner.h:

http://gerrit.cloudera.org:8080/#/c/8775/1/be/src/exec/hdfs-parquet-scanner.h@452
PS1, Line 452: Only scalar columns can be
             :   /// dictionary filtered
> Can you mention in the comment why?
Done. re-worded


http://gerrit.cloudera.org:8080/#/c/8775/1/be/src/exec/hdfs-scan-node-base.h
File be/src/exec/hdfs-scan-node-base.h:

http://gerrit.cloudera.org:8080/#/c/8775/1/be/src/exec/hdfs-scan-node-base.h@167
PS1, Line 167:     return thrift_dict_filter_conjuncts_;
> You could make the calling code cleaner by returning an empty vector here.
its used in one place so don't think its worth it.


http://gerrit.cloudera.org:8080/#/c/8775/1/be/src/exec/hdfs-scan-node-base.h@382
PS1, Line 382:   /// Dictionary filtering eligible conjuncts for each slot.
> Can you add to the comment that this can be nullptr?
Done


http://gerrit.cloudera.org:8080/#/c/8775/1/be/src/exec/hdfs-scanner.cc
File be/src/exec/hdfs-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/8775/1/be/src/exec/hdfs-scanner.cc@90
PS1, Line 90: auto
> We usually only use auto for iterators and const& in for loops. I think thi
Done


http://gerrit.cloudera.org:8080/#/c/8775/1/be/src/exec/hdfs-scanner.cc@97
PS1, Line 97: entry.tuple_id
> you could just use 'tuple_id' here (from L86).
Done


http://gerrit.cloudera.org:8080/#/c/8775/1/common/thrift/PlanNodes.thrift
File common/thrift/PlanNodes.thrift:

http://gerrit.cloudera.org:8080/#/c/8775/1/common/thrift/PlanNodes.thrift@202
PS1, Line 202: between a TupleId, SlotId pair to a
> nit: I think it should be "mapping between ... and ..." or "mapping from ..
Done


http://gerrit.cloudera.org:8080/#/c/8775/1/common/thrift/PlanNodes.thrift@209
PS1, Line 209: optional
> Shouldn't they all be required?
Done


http://gerrit.cloudera.org:8080/#/c/8775/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/8775/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@187
PS1, Line 187:   // Map from pairs of TupleDescriptor, SlotIds to indices in 
PlanNodes.conjuncts_ and
> I think this comment could benefit from a description of what makes a slot/
min-max pruning is discussed in the header, but dictionary pruning is not. I 
added a few sentences there so that its less surprising when seeing these 
supporting data-structures for the first time.


http://gerrit.cloudera.org:8080/#/c/8775/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@188
PS1, Line 188: collectionConjuncts
> nit: missing _
Done


http://gerrit.cloudera.org:8080/#/c/8775/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@188
PS1, Line 188:   // collectionConjuncts that are eligible for dictionary 
filtering. The TupleDescriptor
> nit: I'd phrase it as "slots in the TupleDescriptor of this scan node map t
Done


http://gerrit.cloudera.org:8080/#/c/8775/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@624
PS1, Line 624:     Preconditions.checkState(tupleIds.size() == 1);
> It doesn't seem obvious to me why this is true. Can you add a comment?
A bit unclear to me as well. However, if I understand this literally, we only 
support single slot conjuncts for dictionary pruning, so therefore should 
expect that that single slot only refers to a single tuple. I reduced the 
multiple checks against slot id and added a comment.


http://gerrit.cloudera.org:8080/#/c/8775/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@656
PS1, Line 656:    * Walks through conjuncts and collectionConjuncts and 
populates
> nit: trailing _s
Done


http://gerrit.cloudera.org:8080/#/c/8775/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@668
PS1, Line 668: notEmptyCollections_.contains(entry.getKey())
> This could be factored into its own method together with the comment above.
Done


http://gerrit.cloudera.org:8080/#/c/8775/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1010
PS1, Line 1010:     for (Map.Entry<Pair<TupleDescriptor, SlotId>, 
List<Integer>> entry : dictionaryFilterConjuncts_.entrySet()) {
> long line
Done


http://gerrit.cloudera.org:8080/#/c/8775/1/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1075
PS1, Line 1075:       if (!dictionaryFilterConjuncts_.isEmpty()) {
> This check looks like it's not needed.
whoops... missed the clear simplification. Done.


http://gerrit.cloudera.org:8080/#/c/8775/1/testdata/workloads/functional-query/queries/QueryTest/parquet-filtering.test
File 
testdata/workloads/functional-query/queries/QueryTest/parquet-filtering.test:

http://gerrit.cloudera.org:8080/#/c/8775/1/testdata/workloads/functional-query/queries/QueryTest/parquet-filtering.test@a2
PS1, Line 2:
> Why is this not needed anymore?
I saw in history that it was the case at one point, but the mirroring has since 
been removed, so this comment is stale.


http://gerrit.cloudera.org:8080/#/c/8775/1/testdata/workloads/functional-query/queries/QueryTest/parquet-filtering.test@265
PS1, Line 265: # - number of projections and their nesting depth
> I suspect this comment may rot if we keep it here, but to me it looks like
I think there's some benefit to explicitly listing the areas tested/to-test so 
that we can see expectations and coverage in-line. If expansion is required 
here due to extra cases or bugs, I'd find it harder to figure out the scope 
from multiple commit messages.


http://gerrit.cloudera.org:8080/#/c/8775/1/testdata/workloads/functional-query/queries/QueryTest/parquet-filtering.test@275
PS1, Line 275: 1
> Why is only one row group filtered out here and not 2?
there are two files in the table. one of them is dictionary encoded for int_map 
key (so its filtered) and the other is plain encoded (so is not filtered). when 
I look at the plain encoded one, I see only one value, so perhaps the idea is 
not to use dictionary encoding when its not worth it?


http://gerrit.cloudera.org:8080/#/c/8775/1/testdata/workloads/functional-query/queries/QueryTest/parquet-filtering.test@374
PS1, Line 374: # Array struct value at depth 3. Bottom not required. No 
matches. Multiple nested values projected.
> nit: long lines
Done



--
To view, visit http://gerrit.cloudera.org:8080/8775
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If3a2abcfc3d0f7d18756816659fed77ce12668dd
Gerrit-Change-Number: 8775
Gerrit-PatchSet: 1
Gerrit-Owner: Vuk Ercegovac <vercego...@cloudera.com>
Gerrit-Reviewer: Lars Volker <l...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <vercego...@cloudera.com>
Gerrit-Comment-Date: Mon, 11 Dec 2017 20:47:53 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4993: extend dictionary filtering to collections

Reply via email to