Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/17478 )
Change subject: IMPALA-10709: Min/max filters should be enabled for joins on sorted columns in Parquet tables ...................................................................... Patch Set 30: (19 comments) Thanks for working on this, it looks great overall. I'll need to do another round next week to better comprehend the algorithms and tests. http://gerrit.cloudera.org:8080/#/c/17478/30/be/src/exec/parquet/hdfs-parquet-scanner.cc File be/src/exec/parquet/hdfs-parquet-scanner.cc: http://gerrit.cloudera.org:8080/#/c/17478/30/be/src/exec/parquet/hdfs-parquet-scanner.cc@a1219 PS30, Line 1219: nit: Unintended line removal? http://gerrit.cloudera.org:8080/#/c/17478/30/be/src/exec/parquet/hdfs-parquet-scanner.cc@1032 PS30, Line 1032: DCHECK nit: DCHECK_EQ http://gerrit.cloudera.org:8080/#/c/17478/30/be/src/exec/parquet/hdfs-parquet-scanner.cc@1032 PS30, Line 1032: DCHECK(v1.length() == sizeof(bool)); : DCHECK(v2.length() == sizeof(bool)); : return (*reinterpret_cast<const bool*>(v1.data())) : < (*reinterpret_cast<const bool*>(v2.data())); To avoid code duplication, we could have a template function for this. And we could specialise that function to the data types that need special care. http://gerrit.cloudera.org:8080/#/c/17478/30/be/src/exec/parquet/hdfs-parquet-scanner.cc@1059 PS30, Line 1059: return [](const string& v1, const string& v2) { : return (*reinterpret_cast<const DateValue*>(v1.data())) nit: Don't we want the DCHECKs here as well? http://gerrit.cloudera.org:8080/#/c/17478/30/be/src/exec/parquet/hdfs-parquet-scanner.cc@1194 PS30, Line 1194: set I wonder if we could use a bitset instead. Or vector<bool> which usually has a packed representation where a bool is stored in a single bit. http://gerrit.cloudera.org:8080/#/c/17478/30/be/src/exec/parquet/hdfs-parquet-scanner.cc@1218 PS30, Line 1218: vector<string> min_vals; : vector<string> max_vals; nit: you could use std::vector::reserve() to reserve capacity. http://gerrit.cloudera.org:8080/#/c/17478/30/be/src/exec/parquet/hdfs-parquet-scanner.cc@1237 PS30, Line 1237: auto page_range I think this copies the elements, you should probably use const auto&. http://gerrit.cloudera.org:8080/#/c/17478/30/be/src/exec/parquet/hdfs-parquet-scanner.cc@1242 PS30, Line 1242: auto page_range same as above http://gerrit.cloudera.org:8080/#/c/17478/30/be/src/exec/parquet/hdfs-parquet-scanner.cc@1279 PS30, Line 1279: stringstream ss; : ss << "Page " << i << " is not in the skipped page ranges."; : return Status(ss.str()); nit: For these we usually use Substitute(), e.g.: return Status(Substitute("Page $0 is not in the skipped page ranges.", i)); http://gerrit.cloudera.org:8080/#/c/17478/30/be/src/exec/parquet/hdfs-parquet-scanner.cc@1293 PS30, Line 1293: stringstream ss; : ss << "Page index set should be empty."; : return Status(ss.str()); You could just simply return Status("Page index set should be empty."); http://gerrit.cloudera.org:8080/#/c/17478/30/be/src/exec/parquet/hdfs-parquet-scanner.cc@1367 PS30, Line 1367: DCHECK DCHECK_EQ http://gerrit.cloudera.org:8080/#/c/17478/30/be/src/exec/parquet/hdfs-parquet-scanner.cc@1392 PS30, Line 1392: std:: no std:: qualifier is needed here http://gerrit.cloudera.org:8080/#/c/17478/30/be/src/exec/parquet/hdfs-parquet-scanner.cc@1394 PS30, Line 1394: std:: not std::qualifier is needed here http://gerrit.cloudera.org:8080/#/c/17478/30/be/src/exec/parquet/parquet-common.h File be/src/exec/parquet/parquet-common.h: http://gerrit.cloudera.org:8080/#/c/17478/30/be/src/exec/parquet/parquet-common.h@83 PS30, Line 83: inline nit: A function defined within a class/struct definition is implicitly inline, no need for the keyword. http://gerrit.cloudera.org:8080/#/c/17478/30/be/src/exec/parquet/parquet-common.h@90 PS30, Line 90: nit: could be const http://gerrit.cloudera.org:8080/#/c/17478/30/be/src/exec/parquet/parquet-page-index-test.cc File be/src/exec/parquet/parquet-page-index-test.cc: http://gerrit.cloudera.org:8080/#/c/17478/30/be/src/exec/parquet/parquet-page-index-test.cc@109 PS30, Line 109: std:: nit: no need for std:: http://gerrit.cloudera.org:8080/#/c/17478/30/be/src/util/min-max-filter.cc File be/src/util/min-max-filter.cc: http://gerrit.cloudera.org:8080/#/c/17478/30/be/src/util/min-max-filter.cc@1093 PS30, Line 1093: nit: needs +2 indent http://gerrit.cloudera.org:8080/#/c/17478/30/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java: http://gerrit.cloudera.org:8080/#/c/17478/30/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@2777 PS30, Line 2777: /* : * Get the names of all sort by columns (specified in the SORT BY clause in : * CREATE TABLE DDL) from TBLPROPERTIES. : */ nit: I think in function definitions we usually just use // comments http://gerrit.cloudera.org:8080/#/c/17478/30/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@2788 PS30, Line 2788: ( nit: no need for parenthesis -- To view, visit http://gerrit.cloudera.org:8080/17478 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I28c19c4b39b01ffa7d275fb245be85c28e9b2963 Gerrit-Change-Number: 17478 Gerrit-PatchSet: 30 Gerrit-Owner: Qifan Chen <[email protected]> Gerrit-Reviewer: Aman Sinha <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Qifan Chen <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Fri, 04 Jun 2021 15:50:46 +0000 Gerrit-HasComments: Yes
