[Impala-ASF-CR] IMPALA-4993: extend dictionary filtering to collections

Alex Behm (Code Review) Wed, 10 Jan 2018 21:47:50 -0800

Alex Behm has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/8775 )


Change subject: IMPALA-4993: extend dictionary filtering to collections
......................................................................


Patch Set 8:

(22 comments)

http://gerrit.cloudera.org:8080/#/c/8775/8/be/src/exec/hdfs-parquet-scanner.h
File be/src/exec/hdfs-parquet-scanner.h:

http://gerrit.cloudera.org:8080/#/c/8775/8/be/src/exec/hdfs-parquet-scanner.h@420
PS8, Line 420:   /// dictionary filtering. The definition/repetition levels are 
are populated by the
are are


http://gerrit.cloudera.org:8080/#/c/8775/8/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/8775/8/be/src/exec/hdfs-parquet-scanner.cc@778
PS8, Line 778:     }
Did you intend to add the collection_reader to non_filtering?

I think deeply nested CollectionColumnReaders might never be added to 
non_dict_filterable_readers_

Maybe that's ok. Might be easier to completely ignore collection column readers 
since those are unconditionally never filterable. Might also be harder, not 
sure, your call.


http://gerrit.cloudera.org:8080/#/c/8775/8/be/src/exec/hdfs-parquet-scanner.cc@809
PS8, Line 809:   if (!state_->query_options().parquet_dictionary_filtering || 
dict_filter_map_.empty()) {
Looks like non_dict_filterable_readers_ is inconsistently populated. With 
dictionary filtering enabled it would also include all nested column readers 
that are not filterable, but with dictionary filtering disabled it only 
includes the top-level column readers. I assume that's not really a problem, 
but still seems inconsistent.


http://gerrit.cloudera.org:8080/#/c/8775/8/be/src/exec/hdfs-parquet-scanner.cc@845
PS8, Line 845:   if (tuple_desc == nullptr) tuple_desc = 
scan_node_->tuple_desc();
Why this check? I believe all SlotDescriptors require a parent. We even DHCECK 
that in the SlotDescriptor c'tor.

If so, we can just remove this function


http://gerrit.cloudera.org:8080/#/c/8775/8/be/src/exec/hdfs-scanner.h
File be/src/exec/hdfs-scanner.h:

http://gerrit.cloudera.org:8080/#/c/8775/8/be/src/exec/hdfs-scanner.h@222
PS8, Line 222:   typedef std::map<std::pair<TupleId, SlotId>, 
std::vector<ScalarExprEvaluator*>>
Why is the key a pair and not just SlotId or a SlotDescriptor*?


http://gerrit.cloudera.org:8080/#/c/8775/8/common/thrift/PlanNodes.thrift
File common/thrift/PlanNodes.thrift:

http://gerrit.cloudera.org:8080/#/c/8775/8/common/thrift/PlanNodes.thrift@245
PS8, Line 245:   9: optional list<TDictFilter> dictionary_filter_conjuncts
Not sure I understand why this needed to change. SlotIds are unique, so what 
benefit do we get from adding the tuple id? The TupleId can be derived from the 
SlotId through the DescriptorTbl.


http://gerrit.cloudera.org:8080/#/c/8775/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

http://gerrit.cloudera.org:8080/#/c/8775/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@115
PS8, Line 115:  * each value present in the row group's dictionary
... prune a row group by evaluating the conjuncts against the respective column 
dictionaries.

(there's no such thing as "the row group's dictionary" right?)


http://gerrit.cloudera.org:8080/#/c/8775/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@200
PS8, Line 200:   private Map<Pair<TupleDescriptor, SlotId>, List<Integer>> 
dictionaryFilterConjuncts_ =
Unfortunate that we have to use a Pair as key since conceptually the pair has 
redundant information (slot and tuple are 1:1)

Maybe a SlotDescriptor would work as key instead?


http://gerrit.cloudera.org:8080/#/c/8775/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@543
PS8, Line 543:    * These filters are added by analysis (see: 
SelectStmt#registerIsNotEmptyPredicates).
Move this comment to notEmptyCollections_


http://gerrit.cloudera.org:8080/#/c/8775/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@548
PS8, Line 548:   private boolean isCollectionNotEmpty(TupleDescriptor desc) {
Why do we need this function? Can't we just check 
notEmptyCollections_.contains(desc) inline?


http://gerrit.cloudera.org:8080/#/c/8775/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@638
PS8, Line 638:   private void addDictionaryFilter(Analyzer analyzer, Expr 
conjunct, int conjunct_idx) {
conjunctIdx


http://gerrit.cloudera.org:8080/#/c/8775/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@641
PS8, Line 641:
remove newline (these 3 lines kind of belong together)


http://gerrit.cloudera.org:8080/#/c/8775/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@675
PS8, Line 675:     Pair<TupleDescriptor, SlotId> slotKey =
If you agree to simplify the map key to SlotDescriptor, then I think we can 
remove tupleIds from this function entirely. The Preconditions check in L647 is 
not that useful because that's a fundamental guarantee of our design.


http://gerrit.cloudera.org:8080/#/c/8775/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@677
PS8, Line 677:     if (dictionaryFilterConjuncts_.containsKey(slotKey)) {
We typically use this more idiomatic pattern to avoid duplicate lookups (same 
LOC):

List<Integer> slotList = dictionaryFilterConjuncts_.get(slotKey);
if (slotList == null) {
  slotList = Lists.newArrayList();
  dictionaryFilterConjuncts_.put(slotKey, slotList);
}
slotList.add(conjunctIdx);


http://gerrit.cloudera.org:8080/#/c/8775/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@690
PS8, Line 690:     for (int conjunct_idx = 0; conjunct_idx < conjuncts_.size(); 
++conjunct_idx) {
conjunctIdx here and below


http://gerrit.cloudera.org:8080/#/c/8775/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1039
PS8, Line 1039:       f.setTuple_id(entry.getKey().first.getId().asInt());
The tuple id redundant. Can we avoid sending it?


http://gerrit.cloudera.org:8080/#/c/8775/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1102
PS8, Line 1102:       // Groups the dictionary filterable conjuncts by tuple 
descriptor.
Let's add this into a helper function.


http://gerrit.cloudera.org:8080/#/c/8775/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1116
PS8, Line 1116:         List<Integer> totalIdxList = 
Lists.newArrayList(entry.getValue());
why a new arraylist?


http://gerrit.cloudera.org:8080/#/c/8775/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java@1121
PS8, Line 1121:         List<Expr> exprList = Lists.newArrayList();
move closer to where it's used (before L1132)


http://gerrit.cloudera.org:8080/#/c/8775/8/testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test:

http://gerrit.cloudera.org:8080/#/c/8775/8/testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test@330
PS8, Line 330:    parquet statistics predicates: c_custkey > 0, o.o_orderkey > 
0, l.l_partkey > 0
inconsistent explain-plan printing: the stats predicates are all clumped 
together, but the dictionary predicates are grouped by tuple

Ok to defer the fix if you prefer, but I think we should address the 
inconsistency.


http://gerrit.cloudera.org:8080/#/c/8775/8/testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test@385
PS8, Line 385: l.l_receiptdate = '1994-08-24' and l.l_shipmode = 'RAIL' and 
l.l_returnflag = 'R'
also add a predicate that is true when the slot is NULL


http://gerrit.cloudera.org:8080/#/c/8775/8/testdata/workloads/functional-query/queries/QueryTest/parquet-filtering.test
File 
testdata/workloads/functional-query/queries/QueryTest/parquet-filtering.test:

http://gerrit.cloudera.org:8080/#/c/8775/8/testdata/workloads/functional-query/queries/QueryTest/parquet-filtering.test@272
PS8, Line 272: select id from functional_parquet.complextypestbl f, f.int_map m 
where m.key = 'k5'
If not already tested elsewhere, it might be worth adding an equivalent test 
like this just to make sure it works:

select id from functional_parquet.complextypestbl.int_map m where m.key = 'k5'



--
To view, visit http://gerrit.cloudera.org:8080/8775
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: If3a2abcfc3d0f7d18756816659fed77ce12668dd
Gerrit-Change-Number: 8775
Gerrit-PatchSet: 8
Gerrit-Owner: Vuk Ercegovac <vercego...@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com>
Gerrit-Reviewer: Lars Volker <l...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Reviewer: Vuk Ercegovac <vercego...@cloudera.com>
Gerrit-Comment-Date: Thu, 11 Jan 2018 05:47:27 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4993: extend dictionary filtering to collections

Reply via email to