Joe McDonnell has posted comments on this change. Change subject: IMPALA-4624: Implement Parquet dictionary filtering ......................................................................
Patch Set 8: (31 comments) Addressed the review comments. I also added a query option "parquet_dictionary_filtering" which defaults to true. http://gerrit.cloudera.org:8080/#/c/5904/8/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: Line 515: RETURN_IF_ERROR(InitColumns(row_group_idx_, column_readers_)); > good point. InitColumns() needs to be split up so that the construction of I changed this so that we call InitColumns for only the dictionary filtering columns before doing dictionary filtering. This means that the reads for the other columns will only occur if the row group passes. To do better than this (i.e. to only read the dictionary), we will need to do what Marcel suggests and rework the code more substantially. Line 532: RETURN_IF_ERROR(InitDictionaries(column_readers_)); > the control flow is hard to follow because it the top-level functions don't Done Line 607: // values are not validated, so skip these datatypes for now. > also leave todo, this can be done during dictionary construction. Done Line 615: Status HdfsParquetScanner::EvalDictionaryFilters(int row_group_idx, > indentation Done Line 628: for (ParquetColumnReader* col_reader : dict_filter_column_readers_) { > there is a lot of code here that doesn't evaluate a dictionary. restructure Done Line 637: vector<ExprContext*>& slot_conjunct_ctxs = slot_conjunct_it->second; > move to where variables are used. Done Line 701: // here or in ReadDataPage but never both. > is that still true? i thought you expect the first page to be dictionary-en Good point, corrected the comment. The dictionary page offset is used to set the start of the stream. The dictionary page offset is required to be less than the data page offset, so the stream needs to start at dictionary page offset. Line 710: // TODO: handle the 40000 in a more elegant manner (make it a named constant) > resolve todo (don't leave todos that take very little time to resolve). Done Line 723: column_passes_filter = true; > 'passes filter' is ambiguous (does this mean the predicate is true or false Done http://gerrit.cloudera.org:8080/#/c/5904/8/be/src/exec/hdfs-parquet-scanner.h File be/src/exec/hdfs-parquet-scanner.h: Line 613: void InitDictionaryFilters(); > not intuitively named, i thought this had something to do with dictionary c Changed the name http://gerrit.cloudera.org:8080/#/c/5904/8/be/src/exec/hdfs-parquet-table-writer.cc File be/src/exec/hdfs-parquet-table-writer.cc: Line 244: unordered_set<Encoding::type> column_encodings_; > missing comments Done Line 525: dict_encoding_stats_[dict_header.encoding]++; > ++... Done Line 611: column_encodings_.insert(header.data_page_header.definition_level_encoding); > is this useful/required? the def/rep level encodings don't vary. Changed this to do the def/rep level encodings once at the start. Line 1042: for (Encoding::type encoding : columns_[i]->column_encodings_) { > you use columns_[i] a lot, create a variable Done http://gerrit.cloudera.org:8080/#/c/5904/8/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: PS8, Line 462: NeedsConversionInline > can this be another template parameter? The callers of ReadSlot are on the fast path as well, so those locations would need to know what value to pass into ReadSlot and would need the inlined version. Line 764: // end of the stream. > superfluous comment, the function comment in the header file already makes Done Line 804: new_buffer_size, &buffer, &new_buffer_size, &status, /* peek */ true); > indentation Done Line 811: if (buffer_size == new_buffer_size) { > what does this mean? Added a comment Line 991: return Status("Column chunk should not contain two dictionary pages."); > make this an actionable error message (ie, include the file name). Done http://gerrit.cloudera.org:8080/#/c/5904/8/be/src/exec/parquet-column-readers.h File be/src/exec/parquet-column-readers.h: Line 431: /// an indication of whether a dictionary is present. > so this is true if no dictionary exists? Removed this http://gerrit.cloudera.org:8080/#/c/5904/8/be/src/util/dict-test.cc File be/src/util/dict-test.cc: Line 55: for (T i: values) { > do a similar loop for GetValue() Done http://gerrit.cloudera.org:8080/#/c/5904/8/common/thrift/parquet.thrift File common/thrift/parquet.thrift: Line 1: /** > is this the unmodified parquet.thrift? (and it comes without a format versi Yes, this is the file I found on github. I fixed some whitespace, but there is no version information. http://gerrit.cloudera.org:8080/#/c/5904/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: Line 190: > avoid double-spaced code Done Line 332: * that are bound by only that slot. The single-slot restriction allows > explicitly mention any member variables that get populated (= spell out sid Done Line 349: boolean isDeterministic = true; > instead, add a new predicate to Expr.java (IS_DETERMINISTIC_FN_PREDICATE), Added IS_NONDETERMINISTIC_FN_PREDICATE and reworked this code. Line 359: // This is important for dictionary filtering. Dictionaries do not > so these predicates aren't just 'single-slot conjuncts', because they're sp Renamed throughout the code. Line 692: if (detailLevel.ordinal() >= TExplainLevel.VERBOSE.ordinal()) { > precede w/ blank line Done Line 698: Collections.sort(totalIdxList); > why sort these? Added a comment. I'm sorting to display the predicates in the same order as the other predicate lists. Line 700: for (Integer idx : totalIdxList) { > single line Done http://gerrit.cloudera.org:8080/#/c/5904/8/testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test File testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test: Line 23: single slot predicates: int_col > 1 > might as well call this something like 'dictionary predicates', to make exp Done http://gerrit.cloudera.org:8080/#/c/5904/8/testdata/workloads/functional-query/queries/QueryTest/parquet-filtering.test File testdata/workloads/functional-query/queries/QueryTest/parquet-filtering.test: Line 1: ==== > Please add coverage for nested data filtering as well. Added a query that does dictionary filtering at the top level on tpch_nested_parquet.customer. -- To view, visit http://gerrit.cloudera.org:8080/5904 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I3a7cc3bd0523fbf3c79bd924219e909ef671cfd7 Gerrit-PatchSet: 8 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Joe McDonnell <[email protected]> Gerrit-Reviewer: Alex Behm <[email protected]> Gerrit-Reviewer: Joe McDonnell <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Marcel Kornacker <[email protected]> Gerrit-Reviewer: Matthew Mulder <[email protected]> Gerrit-Reviewer: Mostafa Mokhtar <[email protected]> Gerrit-HasComments: Yes
