Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17478 )

Change subject: IMPALA-10709: Min/max filters should be enabled for joins on 
sorted columns in Parquet tables
......................................................................


Patch Set 32:

(21 comments)

http://gerrit.cloudera.org:8080/#/c/17478/32//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/17478/32//COMMIT_MSG@25
PS32, Line 25: expessed
expressed


http://gerrit.cloudera.org:8080/#/c/17478/32/be/src/exec/parquet/hdfs-parquet-scanner.h
File be/src/exec/parquet/hdfs-parquet-scanner.h:

http://gerrit.cloudera.org:8080/#/c/17478/32/be/src/exec/parquet/hdfs-parquet-scanner.h@361
PS32, Line 361: at most two
I've found the comment message a bit hard to digest. Could you please elaborate 
a little? E.g. "we are returning page ranges that need to be skipped. Since the 
pages are sorted, and we are filtering them with min/max filters, the 
followings are possible:

 * return empty, i.e. no pages can be skipped
 * return ['start_page_idx', 'end_page_idx'], i.e. every page is skipped
 * return ['start_page_idx', K], i.e. a page range at the beginning is skipped
 * return [L, 'end_page_idx'], i.e. a page range at the end is skipped
 * return ['start_page_idx', K], [L, 'end_page_idx'], i.e. some pages at the 
middle are retained


http://gerrit.cloudera.org:8080/#/c/17478/32/be/src/exec/parquet/hdfs-parquet-scanner.h@364
PS32, Line 364: in the two arrays.
I've found 'in the two arrays' confusing here. Please either remove it add some 
explanation. It probably refers to the min_vals/max_vals arrays, but we are 
retaining pages so I don't really understand this sentence.


http://gerrit.cloudera.org:8080/#/c/17478/32/be/src/exec/parquet/hdfs-parquet-scanner.h@364
PS32, Line 364: ['start_page_idx', 'end_page_idx']
Please add comment that these pages are required to be non-null pages


http://gerrit.cloudera.org:8080/#/c/17478/32/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/17478/32/be/src/exec/parquet/hdfs-parquet-scanner.cc@a1059
PS32, Line 1059:
nit: we could keep the old indentation in this case


http://gerrit.cloudera.org:8080/#/c/17478/32/be/src/exec/parquet/hdfs-parquet-scanner.cc@1023
PS32, Line 1023: St
nit: formatting


http://gerrit.cloudera.org:8080/#/c/17478/32/be/src/exec/parquet/hdfs-parquet-scanner.cc@1025
PS32, Line 1025: CompareFunc GetCompareLessFunc() {
nit: probably we could mark this as 'inline'


http://gerrit.cloudera.org:8080/#/c/17478/32/be/src/exec/parquet/hdfs-parquet-scanner.cc@1059
PS32, Line 1059:       return [](const string& v1, const string& v2) {
               :         DCHECK(v1.length() == sizeof(StringValue));
               :         DCHECK(v2.length() == sizeof(StringValue));
               :         const StringValue* string_value1 =
               :             reinterpret_cast<const StringValue*>(v1.data());
               :         const StringValue* string_value2 =
               :             reinterpret_cast<const StringValue*>(v2.data());
               :         return string_value1->Compare(*string_value2) < 0;
               :       };
Why can't we use GetCompareLessFunc<StringValue>? StringValue has

 bool StringValue::operator<(const StringValue& other) const


http://gerrit.cloudera.org:8080/#/c/17478/32/be/src/exec/parquet/hdfs-parquet-scanner.cc@1209
PS32, Line 1209: (
nit: parenthesis not required


http://gerrit.cloudera.org:8080/#/c/17478/32/be/src/exec/parquet/hdfs-parquet-scanner.cc@1209
PS32, Line 1209: (
nit: parenthesis not required


http://gerrit.cloudera.org:8080/#/c/17478/32/be/src/exec/parquet/hdfs-parquet-scanner.cc@1239
PS32, Line 1239: Page $0 is not in the skipped page ranges.
nit: please add a bit more verbose error message, and mention the fast path <-> 
regular path inconsistency. You could also mention to turn off the fast code 
path as a workaround.


http://gerrit.cloudera.org:8080/#/c/17478/32/be/src/exec/parquet/hdfs-parquet-scanner.cc@1254
PS32, Line 1254: Page $0 is in the skipped page ranges.
nit: please make it a bit more verbose, see above.


http://gerrit.cloudera.org:8080/#/c/17478/32/be/src/exec/parquet/hdfs-parquet-scanner.cc@1330
PS32, Line 1330: start_page_idx > end_page_idx
start_page_idx and eng_page_idx define an inclusive range, right? So should we 
check the page if these are equal?


http://gerrit.cloudera.org:8080/#/c/17478/32/be/src/exec/parquet/hdfs-parquet-scanner.cc@1349
PS32, Line 1349: GetMax()
'filter_max'


http://gerrit.cloudera.org:8080/#/c/17478/32/be/src/exec/parquet/hdfs-parquet-scanner.cc@1349
PS32, Line 1349: In the
               :   // former case, move itor 'low' forward to the 1st element 
*low > GetMax().
upper_bound would tell that immediately, wouldn't it?

 auto it = std::upper_bound(begin, end, filter_max, compare_less);
 if (it != end) {
   int idx = it - begin + start_page_idx;
   skipped_ranges->emplace_back(PageRange(idx, end_page_idx));
 }


http://gerrit.cloudera.org:8080/#/c/17478/32/be/src/exec/parquet/hdfs-parquet-scanner.cc@1368
PS32, Line 1368: GetMin()
'filter_min'


http://gerrit.cloudera.org:8080/#/c/17478/32/be/src/exec/parquet/hdfs-parquet-scanner.cc@1369
PS32, Line 1369: move itor 'up' backwards to the 1st element *up < GetMin().
Then I think we should use lower_bound() here, to find the first 'max_vals' 
element not less than 'filter_min'. Then the previous element of it is less 
than 'filter_min', i.e. no need for the loop. E.g.:

 auto it = std::upper_bound(begin, end, filter_min, compare_less);
 if (it != end && it != begin) {
   int idx = it - begin + start_page_idx - 1;
   skipped_ranges->emplace_back(PageRange(start_page_idx, idx));
 }


http://gerrit.cloudera.org:8080/#/c/17478/32/be/src/exec/parquet/hdfs-parquet-scanner.cc@1372
PS32, Line 1372: end_page_idx
If the first phase was successful, we could reuse 'idx' here, i.e.:

 end = max_vals.begin() + idx;

If we switch the order of the two bound-search (as I mentioned in an other 
comment), the same will go for begin.


http://gerrit.cloudera.org:8080/#/c/17478/32/be/src/exec/parquet/hdfs-parquet-scanner.cc@1390
PS32, Line 1390:     // If the two ranges overlap, merge them into a single one.
               :     if (range0.overlap(range1)) {
Is it possible?


http://gerrit.cloudera.org:8080/#/c/17478/32/be/src/exec/parquet/hdfs-parquet-scanner.cc@1396
PS32, Line 1396:        // Else order the two ranges by putting the one with 
smaller first value first.
               :        if (range0.first > range1.first) {
               :          PageRange original_range0 = range0;
               :         (*skipped_ranges)[0] = range1;
               :         (*skipped_ranges)[1] = original_range0;
               :       }
I think currently it's always in reverse order because [idx, end_page_idx] 
first, then [start_page_idx, idx].

I think we should just swap the above order and we don't need this branch.


http://gerrit.cloudera.org:8080/#/c/17478/32/be/src/exec/parquet/parquet-page-index-test.cc
File be/src/exec/parquet/parquet-page-index-test.cc:

http://gerrit.cloudera.org:8080/#/c/17478/32/be/src/exec/parquet/parquet-page-index-test.cc@288
PS32, Line 288: }
Do we have a test where the column index is not present for a column, only the 
offset index? E.g. when float col contains NaN we don't write column index, 
only the offset index.



--
To view, visit http://gerrit.cloudera.org:8080/17478
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I28c19c4b39b01ffa7d275fb245be85c28e9b2963
Gerrit-Change-Number: 17478
Gerrit-PatchSet: 32
Gerrit-Owner: Qifan Chen <[email protected]>
Gerrit-Reviewer: Aman Sinha <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Qifan Chen <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Mon, 07 Jun 2021 14:20:04 +0000
Gerrit-HasComments: Yes

Reply via email to