Marcel Kornacker has posted comments on this change. Change subject: IMPALA-4624: Implement Parquet dictionary filtering ......................................................................
Patch Set 9: (17 comments) http://gerrit.cloudera.org:8080/#/c/5904/9/be/src/exec/hdfs-parquet-scanner.cc File be/src/exec/hdfs-parquet-scanner.cc: Line 71: // is guaranteed to be true for Impala versions of 2.9 or below. "versions 2.9" Line 599: // TODO: The values should be converted at dictionary read and store "during dictionary construction and stored" Line 600: // in converted form in the dictionary. let's tackle this particular todo immediately after the current patch goes in, having dictionary filtering for string columns is going to be very useful. Line 699: if (file_version_.application == "impala" && file_version_.VersionLt(2,10,0)) { this will need to change to 2.9 later, because 2.9 will support encoding_stats Line 706: BaseScalarColumnReader* scalar_reader = why not make this list contain BaseScalarColumnReader*? Line 715: // This reader is not 100% dictionary encoded, so dictionary filters i think it's more accurate to say "we cannot say with certainty whether the columns if 100% ..." http://gerrit.cloudera.org:8080/#/c/5904/9/be/src/exec/hdfs-parquet-scanner.h File be/src/exec/hdfs-parquet-scanner.h: Line 443: std::vector<int8_t> dict_filter_tuple_backing_; you can use ScopedBuffer here. take a look at https://gerrit.cloudera.org/#/c/6032/11/be/src/exec/hdfs-parquet-scanner.h Line 620: /// structures needed for evaluating dictionary filtering. point out which member vars get populated (= side effects) in other words, you could have also phrased this as "populates dict_filterable_readers_ and non_dict_filterable_readers_". http://gerrit.cloudera.org:8080/#/c/5904/9/be/src/exec/parquet-column-readers.cc File be/src/exec/parquet-column-readers.cc: Line 761: Status BaseScalarColumnReader::ReadPageHeader(bool peek, indentation Line 840: // is not a dictionary, then there is no dictionary. how does this interact with the dictionary_page_offset? do we ignore that? Line 849: if (UNLIKELY(data_size < 0)) { these checks aren't really related to dictionaries, can't we do that elsewhere? fine to leave todo. http://gerrit.cloudera.org:8080/#/c/5904/9/be/src/exec/parquet-column-readers.h File be/src/exec/parquet-column-readers.h: Line 366: // Initializes the dictionary, if it exists point out specific state that changes http://gerrit.cloudera.org:8080/#/c/5904/8/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: Line 698: return output.toString(); > Added a comment. I'm sorting to display the predicates in the same order as but isn't the unsorted order the one in which they're being applied? http://gerrit.cloudera.org:8080/#/c/5904/9/fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java File fe/src/main/java/org/apache/impala/planner/HdfsScanNode.java: Line 332: * dictionaryFilterConjuncts_ map. you can leave it as-is, i don't mean to belabor the point, but putting this more concisely would also be fine: "Walks through conjuncts and populates dictionaryFilterConjuncts_." (you already described what that variable means elsewhere) Line 682: if (detailLevel.ordinal() >= TExplainLevel.VERBOSE.ordinal()) { i think the min/max predicates are output at extended level, might as well follow that. Line 693: for (Integer idx : totalIdxList) { exprList.add(conjuncts_.get(idx)); } remove {} http://gerrit.cloudera.org:8080/#/c/5904/9/testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test File testdata/workloads/functional-planner/queries/PlannerTest/parquet-filtering.test: Line 23: dictionary filter predicates: int_col > 1 lars called his "parquet statistics predicates', so let's call these 'parquet dictionary predicates' -- To view, visit http://gerrit.cloudera.org:8080/5904 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I3a7cc3bd0523fbf3c79bd924219e909ef671cfd7 Gerrit-PatchSet: 9 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Joe McDonnell <[email protected]> Gerrit-Reviewer: Alex Behm <[email protected]> Gerrit-Reviewer: Joe McDonnell <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Marcel Kornacker <[email protected]> Gerrit-Reviewer: Matthew Mulder <[email protected]> Gerrit-Reviewer: Mostafa Mokhtar <[email protected]> Gerrit-HasComments: Yes
