[Impala-ASF-CR] IMPALA-4624: Implement Parquet dictionary filtering

Joe McDonnell (Code Review) Wed, 22 Feb 2017 18:25:54 -0800

Joe McDonnell has posted comments on this change.

Change subject: IMPALA-4624: Implement Parquet dictionary filtering
......................................................................



Patch Set 8:

(31 comments)

Addressed the review comments. I also added a query option 
"parquet_dictionary_filtering" which defaults to true.

http://gerrit.cloudera.org:8080/#/c/5904/8/be/src/exec/hdfs-parquet-scanner.cc
File be/src/exec/hdfs-parquet-scanner.cc:

Line 515:     RETURN_IF_ERROR(InitColumns(row_group_idx_, column_readers_));
> good point. InitColumns() needs to be split up so that the construction of 
I changed this so that we call InitColumns for only the dictionary filtering 
columns before doing dictionary filtering. This means that the reads for the 
other columns will only occur if the row group passes. 

To do better than this (i.e. to only read the dictionary), we will need to do 
what Marcel suggests and rework the code more substantially.


Line 532:     RETURN_IF_ERROR(InitDictionaries(column_readers_));
> the control flow is hard to follow because it the top-level functions don't
Done


Line 607:     // values are not validated, so skip these datatypes for now.
> also leave todo, this can be done during dictionary construction.
Done


Line 615:   Status HdfsParquetScanner::EvalDictionaryFilters(int row_group_idx,
> indentation
Done


Line 628:   for (ParquetColumnReader* col_reader : dict_filter_column_readers_) 
{
> there is a lot of code here that doesn't evaluate a dictionary. restructure
Done


Line 637:     vector<ExprContext*>& slot_conjunct_ctxs = 
slot_conjunct_it->second;
> move to where variables are used.
Done


Line 701:     // here or in ReadDataPage but never both.
> is that still true? i thought you expect the first page to be dictionary-en
Good point, corrected the comment.

The dictionary page offset is used to set the start of the stream. The 
dictionary page offset is required to be less than the data page offset, so the 
stream needs to start at dictionary page offset.


Line 710:     // TODO: handle the 40000 in a more elegant manner (make it a 
named constant)
> resolve todo (don't leave todos that take very little time to resolve).
Done


Line 723:         column_passes_filter = true;
> 'passes filter' is ambiguous (does this mean the predicate is true or false
Done


http://gerrit.cloudera.org:8080/#/c/5904/8/be/src/exec/hdfs-parquet-scanner.h
File be/src/exec/hdfs-parquet-scanner.h:

Line 613:   void InitDictionaryFilters();
> not intuitively named, i thought this had something to do with dictionary c
Changed the name


http://gerrit.cloudera.org:8080/#/c/5904/8/be/src/exec/hdfs-parquet-table-writer.cc
File be/src/exec/hdfs-parquet-table-writer.cc:

Line 244:   unordered_set<Encoding::type> column_encodings_;
> missing comments
Done


Line 525:     dict_encoding_stats_[dict_header.encoding]++;
> ++...
Done


Line 611:   
column_encodings_.insert(header.data_page_header.definition_level_encoding);
> is this useful/required? the def/rep level encodings don't vary.
Changed this to do the def/rep level encodings once at the start.


Line 1042:     for (Encoding::type encoding : columns_[i]->column_encodings_) {
> you use columns_[i] a lot, create a variable
Done


http://gerrit.cloudera.org:8080/#/c/5904/8/be/src/exec/parquet-column-readers.cc
File be/src/exec/parquet-column-readers.cc:

PS8, Line 462:  NeedsConversionInline
> can this be another template parameter?
The callers of ReadSlot are on the fast path as well, so those locations would 
need to know what value to pass into ReadSlot and would need the inlined 
version.


Line 764:   // end of the stream.
> superfluous comment, the function comment in the header file already makes 
Done


Line 804:       new_buffer_size, &buffer, &new_buffer_size, &status, /* peek */ 
true);
> indentation
Done


Line 811:     if (buffer_size == new_buffer_size) {
> what does this mean?
Added a comment


Line 991:         return Status("Column chunk should not contain two dictionary 
pages.");
> make this an actionable error message (ie, include the file name).
Done


http://gerrit.cloudera.org:8080/#/c/5904/8/be/src/exec/parquet-column-readers.h
File be/src/exec/parquet-column-readers.h:

Line 431:   /// an indication of whether a dictionary is present.
> so this is true if no dictionary exists?
Removed this


http://gerrit.cloudera.org:8080/#/c/5904/8/be/src/util/dict-test.cc
File be/src/util/dict-test.cc:

Line 55:   for (T i: values) {
> do a similar loop for GetValue()
Done


http://gerrit.cloudera.org:8080/#/c/5904/8/common/thrift/parquet.thrift
File common/thrift/parquet.thrift:

Line 1: /**
> is this the unmodified parquet.thrift? (and it comes without a format versi
Yes, this is the file I found on github. I fixed some whitespace, but there is 
no version information.


http://gerrit.cloudera.org:8080/#/c/5904/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java
File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java:

Line 190: 
> avoid double-spaced code
Done


Line 332:    * that are bound by only that slot. The single-slot restriction 
allows
> explicitly mention any member variables that get populated (= spell out sid
Done


Line 349:       boolean isDeterministic = true;
> instead, add a new predicate to Expr.java (IS_DETERMINISTIC_FN_PREDICATE), 
Added IS_NONDETERMINISTIC_FN_PREDICATE and reworked this code.


Line 359:       // This is important for dictionary filtering. Dictionaries do 
not
> so these predicates aren't just 'single-slot conjuncts', because they're sp
Renamed throughout the code.


Line 692:     if (detailLevel.ordinal() >= TExplainLevel.VERBOSE.ordinal()) {
> precede w/ blank line
Done


Line 698:         Collections.sort(totalIdxList);
> why sort these?
Added a comment. I'm sorting to display the predicates in the same order as the 
other predicate lists.


Line 700:         for (Integer idx : totalIdxList) {
> single line
Done


http://gerrit.cloudera.org:8080/#/c/5904/8/testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test
File 
testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test:

Line 23:      single slot predicates: int_col > 1
> might as well call this something like 'dictionary predicates', to make exp
Done


http://gerrit.cloudera.org:8080/#/c/5904/8/testdata/workloads/functional-query/queries/QueryTest/parquet-filtering.test
File 
testdata/workloads/functional-query/queries/QueryTest/parquet-filtering.test:

Line 1: ====
> Please add coverage for nested data filtering as well.
Added a query that does dictionary filtering at the top level on 
tpch_nested_parquet.customer.


-- 
To view, visit http://gerrit.cloudera.org:8080/5904
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I3a7cc3bd0523fbf3c79bd924219e909ef671cfd7
Gerrit-PatchSet: 8
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Joe McDonnell <[email protected]>
Gerrit-Reviewer: Alex Behm <[email protected]>
Gerrit-Reviewer: Joe McDonnell <[email protected]>
Gerrit-Reviewer: Lars Volker <[email protected]>
Gerrit-Reviewer: Marcel Kornacker <[email protected]>
Gerrit-Reviewer: Matthew Mulder <[email protected]>
Gerrit-Reviewer: Mostafa Mokhtar <[email protected]>
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-4624: Implement Parquet dictionary filtering

Reply via email to