Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17478 )

Change subject: IMPALA-10709: Min/max filters should be enabled for joins on 
sorted columns in Parquet tables
......................................................................


Patch Set 30:

(19 comments)

Thanks for working on this, it looks great overall.
I'll need to do another round next week to better comprehend the algorithms and 
tests.

http://gerrit.cloudera.org:8080/#/c/17478/30/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17478/30/be/src/exec/parquet/hdfs-parquet-scanner.cc@a1219
PS30, Line 1219:
nit: Unintended line removal?


http://gerrit.cloudera.org:8080/#/c/17478/30/be/src/exec/parquet/hdfs-parquet-scanner.cc@1032
PS30, Line 1032: DCHECK
nit: DCHECK_EQ


http://gerrit.cloudera.org:8080/#/c/17478/30/be/src/exec/parquet/hdfs-parquet-scanner.cc@1032
PS30, Line 1032:         DCHECK(v1.length() == sizeof(bool));
               :         DCHECK(v2.length() == sizeof(bool));
               :         return (*reinterpret_cast<const bool*>(v1.data()))
               :             < (*reinterpret_cast<const bool*>(v2.data()));
To avoid code duplication, we could have a template function for this. And we 
could specialise that function to the data types that need special care.


http://gerrit.cloudera.org:8080/#/c/17478/30/be/src/exec/parquet/hdfs-parquet-scanner.cc@1059
PS30, Line 1059: return [](const string& v1, const string& v2) {
               :         return (*reinterpret_cast<const DateValue*>(v1.data()))
nit: Don't we want the DCHECKs here as well?


http://gerrit.cloudera.org:8080/#/c/17478/30/be/src/exec/parquet/hdfs-parquet-scanner.cc@1194
PS30, Line 1194: set
I wonder if we could use a bitset instead. Or vector<bool> which usually has a 
packed representation where a bool is stored in a single bit.


http://gerrit.cloudera.org:8080/#/c/17478/30/be/src/exec/parquet/hdfs-parquet-scanner.cc@1218
PS30, Line 1218:       vector<string> min_vals;
               :       vector<string> max_vals;
nit: you could use std::vector::reserve() to reserve capacity.


http://gerrit.cloudera.org:8080/#/c/17478/30/be/src/exec/parquet/hdfs-parquet-scanner.cc@1237
PS30, Line 1237: auto page_range
I think this copies the elements, you should probably use const auto&.


http://gerrit.cloudera.org:8080/#/c/17478/30/be/src/exec/parquet/hdfs-parquet-scanner.cc@1242
PS30, Line 1242: auto page_range
same as above


http://gerrit.cloudera.org:8080/#/c/17478/30/be/src/exec/parquet/hdfs-parquet-scanner.cc@1279
PS30, Line 1279:           stringstream ss;
               :           ss << "Page " << i << " is not in the skipped page 
ranges.";
               :           return Status(ss.str());
nit: For these we usually use Substitute(), e.g.:

 return Status(Substitute("Page $0 is not in the skipped page ranges.", i));


http://gerrit.cloudera.org:8080/#/c/17478/30/be/src/exec/parquet/hdfs-parquet-scanner.cc@1293
PS30, Line 1293:     stringstream ss;
               :     ss << "Page index set should be empty.";
               :     return Status(ss.str());
You could just simply

 return Status("Page index set should be empty.");


http://gerrit.cloudera.org:8080/#/c/17478/30/be/src/exec/parquet/hdfs-parquet-scanner.cc@1367
PS30, Line 1367: DCHECK
DCHECK_EQ


http://gerrit.cloudera.org:8080/#/c/17478/30/be/src/exec/parquet/hdfs-parquet-scanner.cc@1392
PS30, Line 1392: std::
no std:: qualifier is needed here


http://gerrit.cloudera.org:8080/#/c/17478/30/be/src/exec/parquet/hdfs-parquet-scanner.cc@1394
PS30, Line 1394: std::
not std::qualifier is needed here


http://gerrit.cloudera.org:8080/#/c/17478/30/be/src/exec/parquet/parquet-common.h
File be/src/exec/parquet/parquet-common.h:

http://gerrit.cloudera.org:8080/#/c/17478/30/be/src/exec/parquet/parquet-common.h@83
PS30, Line 83: inline
nit: A function defined within a class/struct definition is implicitly inline, 
no need for the keyword.


http://gerrit.cloudera.org:8080/#/c/17478/30/be/src/exec/parquet/parquet-common.h@90
PS30, Line 90:
nit: could be const


http://gerrit.cloudera.org:8080/#/c/17478/30/be/src/exec/parquet/parquet-page-index-test.cc
File be/src/exec/parquet/parquet-page-index-test.cc:

http://gerrit.cloudera.org:8080/#/c/17478/30/be/src/exec/parquet/parquet-page-index-test.cc@109
PS30, Line 109: std::
nit: no need for std::


http://gerrit.cloudera.org:8080/#/c/17478/30/be/src/util/min-max-filter.cc
File be/src/util/min-max-filter.cc:

http://gerrit.cloudera.org:8080/#/c/17478/30/be/src/util/min-max-filter.cc@1093
PS30, Line 1093:
nit: needs +2 indent


http://gerrit.cloudera.org:8080/#/c/17478/30/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java
File fe/src/main/java/org/apache/impala/catalog/HdfsTable.java:

http://gerrit.cloudera.org:8080/#/c/17478/30/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@2777
PS30, Line 2777:       /*
               :        *  Get the names of all sort by columns (specified in 
the SORT BY clause in
               :        *  CREATE TABLE DDL) from TBLPROPERTIES.
               :        */
nit: I think in function definitions we usually just use // comments


http://gerrit.cloudera.org:8080/#/c/17478/30/fe/src/main/java/org/apache/impala/catalog/HdfsTable.java@2788
PS30, Line 2788: (
nit: no need for parenthesis



--
To view, visit http://gerrit.cloudera.org:8080/17478
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I28c19c4b39b01ffa7d275fb245be85c28e9b2963
Gerrit-Change-Number: 17478
Gerrit-PatchSet: 30
Gerrit-Owner: Qifan Chen <[email protected]>
Gerrit-Reviewer: Aman Sinha <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Qifan Chen <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Fri, 04 Jun 2021 15:50:46 +0000
Gerrit-HasComments: Yes

Reply via email to