Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/12065 )
Change subject: IMPALA-5843: Use page index in Parquet files to skip pages ...................................................................... Patch Set 7: (11 comments) Spent some time trying to understand it a bit better. I'm also playing around with it a bit and running a single node benchmark. http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc File be/src/exec/parquet/hdfs-parquet-scanner.cc: http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@773 PS7, Line 773: end nit: ends http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@798 PS7, Line 798: return Status(Substitute("No offset index for '$0'", We're assuming that a valid Parquet file has an offset index if it has a column index (it doesn't make much sense otherwise)? And therefore its OK to fail scanning if we find such a file? I think this makes sense, but it would help readers to be explicit about that here. http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@817 PS7, Line 817: COUNTER_ADD(num_stats_filtered_pages_counter_, nit: it's weird that filtered pages means "eliminated" pages in the counter but surviving pages in the filtered_pages_count variable. http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.cc File be/src/exec/parquet/parquet-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.cc@512 PS7, Line 512: !num_buffered_values_ == 0 && This expression is weird - is the ! intended? http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.cc@655 PS7, Line 655: candidate_data_pages_.empty() I think an inline function might make the logic easier to follow. E.g. PageFilteringEnabled(). When reading I find myself having to negate it in my head each time (i.e. "empty means that filtering is enabled"). http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.cc@1105 PS7, Line 1105: } nit: else if on some line. http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.cc@1109 PS7, Line 1109: // is actually the location of the dictionary page. parquet-mr wrote things this way right? Might be worth mentioning for context? http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.cc@1608 PS7, Line 1608: if (!SkipTopLevelRows(skip_rows)) return Status("Couldn't skip rows."); I guess we don't expect this to occur, but in case it does, we should include the file name to allow debugging the error. http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-common.cc File be/src/exec/parquet/parquet-common.cc: http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-common.cc@96 PS7, Line 96: for (auto& skip_range : skip_ranges) { It might be worth validating the sortedness precondition here by tracking the previous range start. http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-common.cc@123 PS7, Line 123: vector<int> filtered_pages; filtered_pages is ambiguous. Maybe surviving_pages? remaining_pages? candidate_pages? http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-page-index.h File be/src/exec/parquet/parquet-page-index.h: http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-page-index.h@60 PS7, Line 60: the to -- To view, visit http://gerrit.cloudera.org:8080/12065 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I0cc99f129f2048dbafbe7f5a51d1ea3a5005731a Gerrit-Change-Number: 12065 Gerrit-PatchSet: 7 Gerrit-Owner: Zoltan Borok-Nagy <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Michael Ho <[email protected]> Gerrit-Reviewer: Pooja Nilangekar <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Wed, 20 Mar 2019 01:20:54 +0000 Gerrit-HasComments: Yes
