Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/17478 )
Change subject: IMPALA-10709: Min/max filters should be enabled for joins on sorted columns in Parquet tables ...................................................................... Patch Set 32: (21 comments) http://gerrit.cloudera.org:8080/#/c/17478/32//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/17478/32//COMMIT_MSG@25 PS32, Line 25: expessed expressed http://gerrit.cloudera.org:8080/#/c/17478/32/be/src/exec/parquet/hdfs-parquet-scanner.h File be/src/exec/parquet/hdfs-parquet-scanner.h: http://gerrit.cloudera.org:8080/#/c/17478/32/be/src/exec/parquet/hdfs-parquet-scanner.h@361 PS32, Line 361: at most two I've found the comment message a bit hard to digest. Could you please elaborate a little? E.g. "we are returning page ranges that need to be skipped. Since the pages are sorted, and we are filtering them with min/max filters, the followings are possible: * return empty, i.e. no pages can be skipped * return ['start_page_idx', 'end_page_idx'], i.e. every page is skipped * return ['start_page_idx', K], i.e. a page range at the beginning is skipped * return [L, 'end_page_idx'], i.e. a page range at the end is skipped * return ['start_page_idx', K], [L, 'end_page_idx'], i.e. some pages at the middle are retained http://gerrit.cloudera.org:8080/#/c/17478/32/be/src/exec/parquet/hdfs-parquet-scanner.h@364 PS32, Line 364: in the two arrays. I've found 'in the two arrays' confusing here. Please either remove it add some explanation. It probably refers to the min_vals/max_vals arrays, but we are retaining pages so I don't really understand this sentence. http://gerrit.cloudera.org:8080/#/c/17478/32/be/src/exec/parquet/hdfs-parquet-scanner.h@364 PS32, Line 364: ['start_page_idx', 'end_page_idx'] Please add comment that these pages are required to be non-null pages http://gerrit.cloudera.org:8080/#/c/17478/32/be/src/exec/parquet/hdfs-parquet-scanner.cc File be/src/exec/parquet/hdfs-parquet-scanner.cc: http://gerrit.cloudera.org:8080/#/c/17478/32/be/src/exec/parquet/hdfs-parquet-scanner.cc@a1059 PS32, Line 1059: nit: we could keep the old indentation in this case http://gerrit.cloudera.org:8080/#/c/17478/32/be/src/exec/parquet/hdfs-parquet-scanner.cc@1023 PS32, Line 1023: St nit: formatting http://gerrit.cloudera.org:8080/#/c/17478/32/be/src/exec/parquet/hdfs-parquet-scanner.cc@1025 PS32, Line 1025: CompareFunc GetCompareLessFunc() { nit: probably we could mark this as 'inline' http://gerrit.cloudera.org:8080/#/c/17478/32/be/src/exec/parquet/hdfs-parquet-scanner.cc@1059 PS32, Line 1059: return [](const string& v1, const string& v2) { : DCHECK(v1.length() == sizeof(StringValue)); : DCHECK(v2.length() == sizeof(StringValue)); : const StringValue* string_value1 = : reinterpret_cast<const StringValue*>(v1.data()); : const StringValue* string_value2 = : reinterpret_cast<const StringValue*>(v2.data()); : return string_value1->Compare(*string_value2) < 0; : }; Why can't we use GetCompareLessFunc<StringValue>? StringValue has bool StringValue::operator<(const StringValue& other) const http://gerrit.cloudera.org:8080/#/c/17478/32/be/src/exec/parquet/hdfs-parquet-scanner.cc@1209 PS32, Line 1209: ( nit: parenthesis not required http://gerrit.cloudera.org:8080/#/c/17478/32/be/src/exec/parquet/hdfs-parquet-scanner.cc@1209 PS32, Line 1209: ( nit: parenthesis not required http://gerrit.cloudera.org:8080/#/c/17478/32/be/src/exec/parquet/hdfs-parquet-scanner.cc@1239 PS32, Line 1239: Page $0 is not in the skipped page ranges. nit: please add a bit more verbose error message, and mention the fast path <-> regular path inconsistency. You could also mention to turn off the fast code path as a workaround. http://gerrit.cloudera.org:8080/#/c/17478/32/be/src/exec/parquet/hdfs-parquet-scanner.cc@1254 PS32, Line 1254: Page $0 is in the skipped page ranges. nit: please make it a bit more verbose, see above. http://gerrit.cloudera.org:8080/#/c/17478/32/be/src/exec/parquet/hdfs-parquet-scanner.cc@1330 PS32, Line 1330: start_page_idx > end_page_idx start_page_idx and eng_page_idx define an inclusive range, right? So should we check the page if these are equal? http://gerrit.cloudera.org:8080/#/c/17478/32/be/src/exec/parquet/hdfs-parquet-scanner.cc@1349 PS32, Line 1349: GetMax() 'filter_max' http://gerrit.cloudera.org:8080/#/c/17478/32/be/src/exec/parquet/hdfs-parquet-scanner.cc@1349 PS32, Line 1349: In the : // former case, move itor 'low' forward to the 1st element *low > GetMax(). upper_bound would tell that immediately, wouldn't it? auto it = std::upper_bound(begin, end, filter_max, compare_less); if (it != end) { int idx = it - begin + start_page_idx; skipped_ranges->emplace_back(PageRange(idx, end_page_idx)); } http://gerrit.cloudera.org:8080/#/c/17478/32/be/src/exec/parquet/hdfs-parquet-scanner.cc@1368 PS32, Line 1368: GetMin() 'filter_min' http://gerrit.cloudera.org:8080/#/c/17478/32/be/src/exec/parquet/hdfs-parquet-scanner.cc@1369 PS32, Line 1369: move itor 'up' backwards to the 1st element *up < GetMin(). Then I think we should use lower_bound() here, to find the first 'max_vals' element not less than 'filter_min'. Then the previous element of it is less than 'filter_min', i.e. no need for the loop. E.g.: auto it = std::upper_bound(begin, end, filter_min, compare_less); if (it != end && it != begin) { int idx = it - begin + start_page_idx - 1; skipped_ranges->emplace_back(PageRange(start_page_idx, idx)); } http://gerrit.cloudera.org:8080/#/c/17478/32/be/src/exec/parquet/hdfs-parquet-scanner.cc@1372 PS32, Line 1372: end_page_idx If the first phase was successful, we could reuse 'idx' here, i.e.: end = max_vals.begin() + idx; If we switch the order of the two bound-search (as I mentioned in an other comment), the same will go for begin. http://gerrit.cloudera.org:8080/#/c/17478/32/be/src/exec/parquet/hdfs-parquet-scanner.cc@1390 PS32, Line 1390: // If the two ranges overlap, merge them into a single one. : if (range0.overlap(range1)) { Is it possible? http://gerrit.cloudera.org:8080/#/c/17478/32/be/src/exec/parquet/hdfs-parquet-scanner.cc@1396 PS32, Line 1396: // Else order the two ranges by putting the one with smaller first value first. : if (range0.first > range1.first) { : PageRange original_range0 = range0; : (*skipped_ranges)[0] = range1; : (*skipped_ranges)[1] = original_range0; : } I think currently it's always in reverse order because [idx, end_page_idx] first, then [start_page_idx, idx]. I think we should just swap the above order and we don't need this branch. http://gerrit.cloudera.org:8080/#/c/17478/32/be/src/exec/parquet/parquet-page-index-test.cc File be/src/exec/parquet/parquet-page-index-test.cc: http://gerrit.cloudera.org:8080/#/c/17478/32/be/src/exec/parquet/parquet-page-index-test.cc@288 PS32, Line 288: } Do we have a test where the column index is not present for a column, only the offset index? E.g. when float col contains NaN we don't write column index, only the offset index. -- To view, visit http://gerrit.cloudera.org:8080/17478 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I28c19c4b39b01ffa7d275fb245be85c28e9b2963 Gerrit-Change-Number: 17478 Gerrit-PatchSet: 32 Gerrit-Owner: Qifan Chen <[email protected]> Gerrit-Reviewer: Aman Sinha <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Qifan Chen <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Mon, 07 Jun 2021 14:20:04 +0000 Gerrit-HasComments: Yes
