Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/12065 )
Change subject: IMPALA-5843: Use page index in Parquet files to skip pages ...................................................................... Patch Set 15: (6 comments) I did a pass up to be/src/exec/parquet/parquet-page-index-test.cc of the changes, thanks for all the fixes, it is definitely simpler to follow http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc File be/src/exec/parquet/hdfs-parquet-scanner.cc: http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@708 PS7, Line 708: *is_null_page = column_index.null_pages[page_idx]; > Thanks for the tips! Thanks for your patience with this, it is more readable now. http://gerrit.cloudera.org:8080/#/c/12065/15/be/src/exec/parquet/hdfs-parquet-scanner.cc File be/src/exec/parquet/hdfs-parquet-scanner.cc: http://gerrit.cloudera.org:8080/#/c/12065/15/be/src/exec/parquet/hdfs-parquet-scanner.cc@1170 PS15, Line 1170: // Scalar readers that read collections can be ahead of other readers by 1. :( This stuff got too complicated. http://gerrit.cloudera.org:8080/#/c/12065/15/be/src/exec/parquet/parquet-column-readers.h File be/src/exec/parquet/parquet-column-readers.h: http://gerrit.cloudera.org:8080/#/c/12065/15/be/src/exec/parquet/parquet-column-readers.h@533 PS15, Line 533: isn't anymore pedantry: aren't any more http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.h File be/src/exec/parquet/parquet-column-readers.h: http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.h@529 PS7, Line 529: > Added DCHECKs for 'current_row_range_' here and above. Thanks, the new name is better than what I suggested. http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.cc File be/src/exec/parquet/parquet-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.cc@512 PS7, Line 512: if (IsLastCandidateRange()) { > Yeah, num_buffered_values_ == 0 is properly handled by NextPage(), and here Yeah exactly, the bit-packed runs in Parquet's RLE encoding can have spurious 0's at the end - it doesn't store the # of values in band. http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-common.cc File be/src/exec/parquet/parquet-common.cc: http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-common.cc@96 PS7, Line 96: // We found a gap in 'skip_ranges', i.e. a row range that is not covered by > Added a DCHECK(std::is_sorted), this might be less efficient in debug build Yeah that seems fine, although we avoided the problem anyway :). I think it's fine to have an O(n) DCHECK in an O(n) function. -- To view, visit http://gerrit.cloudera.org:8080/12065 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I0cc99f129f2048dbafbe7f5a51d1ea3a5005731a Gerrit-Change-Number: 12065 Gerrit-PatchSet: 15 Gerrit-Owner: Zoltan Borok-Nagy <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Michael Ho <[email protected]> Gerrit-Reviewer: Pooja Nilangekar <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Sat, 20 Apr 2019 01:12:41 +0000 Gerrit-HasComments: Yes
