[Impala-ASF-CR] IMPALA-4993: extend dictionary filtering to collections

Vuk Ercegovac (Code Review) Thu, 11 Jan 2018 12:08:52 -0800

Vuk Ercegovac has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8775 )


Change subject: IMPALA-4993: extend dictionary filtering to collections
......................................................................


Patch Set 8:

(23 comments)

http://gerrit.cloudera.org:8080/#/c/8775/8/be/src/exec/hdfs-parquet-scanner.h
File be/src/exec/hdfs-parquet-scanner.h:

http://gerrit.cloudera.org:8080/#/c/8775/8/be/src/exec/hdfs-parquet-scanner.h@420
PS8, Line 420:   /// dictionary filtering. The definition/repetition levels are 
are populated by the
> are are
Done


http://gerrit.cloudera.org:8080/#/c/8775/8/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/8775/8/be/src/exec/hdfs-parquet-scanner.cc@778
PS8, Line 778:     }
> FWIW, I discovered today I needed to make some similar adjustments to the c
will take a look at your change more closely.


http://gerrit.cloudera.org:8080/#/c/8775/8/be/src/exec/hdfs-parquet-scanner.cc@778
PS8, Line 778:     }
> Did you intend to add the collection_reader to non_filtering?
no, I intended to omit the collection readers. this partitions all nested 
scalar readers.


http://gerrit.cloudera.org:8080/#/c/8775/8/be/src/exec/hdfs-parquet-scanner.cc@809
PS8, Line 809:   if (!state_->query_options().parquet_dictionary_filtering || 
dict_filter_map_.empty()) {
> Looks like non_dict_filterable_readers_ is inconsistently populated. With d
yup... I assumed that if its a mix of collection + scalars is ok since that's 
what works today (perhaps modified by tim's change). And I don't see a problem 
with flattening down to just the scalar readers.


http://gerrit.cloudera.org:8080/#/c/8775/8/be/src/exec/hdfs-parquet-scanner.cc@845
PS8, Line 845:   if (tuple_desc == nullptr) tuple_desc = 
scan_node_->tuple_desc();
> Why this check? I believe all SlotDescriptors require a parent. We even DHC
that's correct, removed. I think I carried that conservative check over from 
the frontend where the parent is not required to be non-null in the ctor. 
should it be required?


http://gerrit.cloudera.org:8080/#/c/8775/8/be/src/exec/hdfs-scanner.h
File be/src/exec/hdfs-scanner.h:

http://gerrit.cloudera.org:8080/#/c/8775/8/be/src/exec/hdfs-scanner.h@222
PS8, Line 222:   typedef std::map<std::pair<TupleId, SlotId>, 
std::vector<ScalarExprEvaluator*>>
> Why is the key a pair and not just SlotId or a SlotDescriptor*?
I'll comment here on the <tupleid, slotid> vs slotid decision here (covers 
several related comments in the thrift file and hdfsscannode). I suspected that 
slotid is unique across tuples so with your other comments and various other 
implications baked into various apis, I've gone ahead and made this change. 
However, with a two-level naming scheme (tuple/slot), I was a bit unclear on 
this and did not find it stated explicitly. I would've expected this in 
frontend's TupleDescriptor-- should I add a clarifying blurb there?


http://gerrit.cloudera.org:8080/#/c/8775/8/common/thrift/PlanNodes.thrift
File common/thrift/PlanNodes.thrift:

http://gerrit.cloudera.org:8080/#/c/8775/8/common/thrift/PlanNodes.thrift@245
PS8, Line 245:   9: optional list<TDictFilter> dictionary_filter_conjuncts
> Not sure I understand why this needed to change. SlotIds are unique, so wha
Done. See comment in hdfs-scanner for why I went down this route.


http://gerrit.cloudera.org:8080/#/c/8775/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/8775/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@115
PS8, Line 115:  * each value present in the row group's dictionary
> ... prune a row group by evaluating the conjuncts against the respective co
Done


http://gerrit.cloudera.org:8080/#/c/8775/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@200
PS8, Line 200:   private Map<Pair<TupleDescriptor, SlotId>, List<Integer>> 
dictionaryFilterConjuncts_ =
> Unfortunate that we have to use a Pair as key since conceptually the pair h
Done. See comment in hdfs-scanner for why I went down this route.


http://gerrit.cloudera.org:8080/#/c/8775/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@543
PS8, Line 543:    * These filters are added by analysis (see: 
SelectStmt#registerIsNotEmptyPredicates).
> Move this comment to notEmptyCollections_
Done


http://gerrit.cloudera.org:8080/#/c/8775/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@548
PS8, Line 548:   private boolean isCollectionNotEmpty(TupleDescriptor desc) {
> Why do we need this function? Can't we just check notEmptyCollections_.cont
its used in a couple of places and I wanted to keep the same checks in both 
(and not repeat the checks).


http://gerrit.cloudera.org:8080/#/c/8775/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@638
PS8, Line 638:   private void addDictionaryFilter(Analyzer analyzer, Expr 
conjunct, int conjunct_idx) {
> conjunctIdx
Done


http://gerrit.cloudera.org:8080/#/c/8775/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@641
PS8, Line 641:
> remove newline (these 3 lines kind of belong together)
Done


http://gerrit.cloudera.org:8080/#/c/8775/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@675
PS8, Line 675:     Pair<TupleDescriptor, SlotId> slotKey =
> If you agree to simplify the map key to SlotDescriptor, then I think we can
Done


http://gerrit.cloudera.org:8080/#/c/8775/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@677
PS8, Line 677:     if (dictionaryFilterConjuncts_.containsKey(slotKey)) {
> We typically use this more idiomatic pattern to avoid duplicate lookups (sa
Done


http://gerrit.cloudera.org:8080/#/c/8775/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@690
PS8, Line 690:     for (int conjunct_idx = 0; conjunct_idx < conjuncts_.size(); 
++conjunct_idx) {
> conjunctIdx here and below
Done


http://gerrit.cloudera.org:8080/#/c/8775/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1039
PS8, Line 1039:       f.setTuple_id(entry.getKey().first.getId().asInt());
> The tuple id redundant. Can we avoid sending it?
Done. See comment in hdfs-scanner for why I went down this route.


http://gerrit.cloudera.org:8080/#/c/8775/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1102
PS8, Line 1102:       // Groups the dictionary filterable conjuncts by tuple 
descriptor.
> Let's add this into a helper function.
Done


http://gerrit.cloudera.org:8080/#/c/8775/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1116
PS8, Line 1116:         List<Integer> totalIdxList = 
Lists.newArrayList(entry.getValue());
> why a new arraylist?
didn't want to mutate a class member for the purpose of printing diags.


http://gerrit.cloudera.org:8080/#/c/8775/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1121
PS8, Line 1121:         List<Expr> exprList = Lists.newArrayList();
> move closer to where it's used (before L1132)
Done


http://gerrit.cloudera.org:8080/#/c/8775/8/testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test:

http://gerrit.cloudera.org:8080/#/c/8775/8/testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test@330
PS8, Line 330:    parquet statistics predicates: c_custkey > 0, o.o_orderkey > 
0, l.l_partkey > 0
> inconsistent explain-plan printing: the stats predicates are all clumped to
hmm... actually I followed the pattern on L324-5 which separates the predicates 
by tuple.


http://gerrit.cloudera.org:8080/#/c/8775/8/testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test@385
PS8, Line 385: l.l_receiptdate = '1994-08-24' and l.l_shipmode = 'RAIL' and 
l.l_returnflag = 'R'
> also add a predicate that is true when the slot is NULL
Done


http://gerrit.cloudera.org:8080/#/c/8775/8/testdata/workloads/functional-query/queries/QueryTest/parquet-filtering.test
File 
testdata/workloads/functional-query/queries/QueryTest/parquet-filtering.test:

http://gerrit.cloudera.org:8080/#/c/8775/8/testdata/workloads/functional-query/queries/QueryTest/parquet-filtering.test@272
PS8, Line 272: select id from functional_parquet.complextypestbl f, f.int_map m 
where m.key = 'k5'
> If not already tested elsewhere, it might be worth adding an equivalent tes
Done



--
To view, visit http://gerrit.cloudera.org:8080/8775
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If3a2abcfc3d0f7d18756816659fed77ce12668dd
Gerrit-Change-Number: 8775
Gerrit-PatchSet: 8
Gerrit-Owner: Vuk Ercegovac <[email protected]>
Gerrit-Reviewer: Alex Behm <[email protected]>
Gerrit-Reviewer: Lars Volker <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Reviewer: Vuk Ercegovac <[email protected]>
Gerrit-Comment-Date: Thu, 11 Jan 2018 20:08:32 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4993: extend dictionary filtering to collections

Reply via email to