Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/12065 )
Change subject: IMPALA-5843: Use page index in Parquet files to skip pages ...................................................................... Patch Set 8: Code-Review+1 (12 comments) I have only minor comments, mainly about naming and line breaking. http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc File be/src/exec/parquet/hdfs-parquet-scanner.cc: http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@119 PS7, Line 119: num_stats_filtered_row_groups_by_page_index_counter_ = > Good idea! Added counter for it. I am not sure how to track this, but it is also possible that some columns have page indexes, while some do not. For example Parquet-mr does not write stats for int96 timestamps, and I think that it is also not written for double/float if there is a NaN value. http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/hdfs-parquet-scanner.cc File be/src/exec/parquet/hdfs-parquet-scanner.cc: http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@117 PS8, Line 117: NumStatsFilteredRowGroups This name is not precise anymore, as it only means row groups filtered by column index. I am not sure how we should handle this, as keeping the old name is probably useful for compatibility reasons. A possibility is to count both cases in this stat, and have two separate stats for the two reasons. http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@120 PS8, Line 120: NumStatsFilteredRowGroupsByPageIndex Maybe "completely" could be added to the name to make its meaning more exact. The fact that row groups that survived column chunk stats can still be completely eliminated by page filtering may not be obvious for the reader at first, and the current name could be interpreted as the number of row groups where page filtering was applied. http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@517 PS8, Line 517: const parquet::ColumnOrder* col_order = col_idx < col_orders.size() ? : &col_orders[col_idx] : nullptr; nit: I prefer breaking after = http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@525 PS8, Line 525: ColumnStatsReader stat_reader = CreateColumnStatsReader(col_chunk, col_type, nit: I prefer breaking after = http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@752 PS8, Line 752: = nit: I prefer breaking after = http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@755 PS8, Line 755: col_order, *node->element); nit: I prefer breaking after = http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@773 PS8, Line 773: skip_ranges.push_back(GetRowRangeForPage(row_group, scalar_reader->offset_index_, nit: I would prefer breaking before GetRowRangeForPage http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-column-readers.h File be/src/exec/parquet/parquet-column-readers.h: http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-column-readers.h@503 PS8, Line 503: IsLastCandidateRange I think that adding "InRowGroup" to the name would make the call site clearer. http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-column-readers.h@526 PS8, Line 526: CandidateRowsRemainingInCurrentPage The name mislead me a bit, I have assumed that it returns an integer. Maybe something like PageHasRemainingCandidateRows would be better? http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-column-readers.cc File be/src/exec/parquet/parquet-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-column-readers.cc@1732 PS8, Line 1732: num_values_read_ -= num_buffered_values_; This seems to be the only place where num_values_read_ is decreased, so it looks like num_values_read_ = total_number_of_values_in_row_group - number_of_values_in_completely_skipped_pages - number_of_values_skipped_at_the_end_of_pages. It seems that we only use this variable in cases that are irrelevant with PageFiltering(), so I would remove this subtraction, and add some comments to num_values_read_ about its role if there is no page filtering, and its irrelevance if there is page filtering. http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-common.h File be/src/exec/parquet/parquet-common.h: http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-common.h@329 PS8, Line 329: str_len += sizeof(int32_t); : if (UNLIKELY(str_len < 0 || buffer_end - buffer < str_len)) return -1; I am not sure if this matters but str_len of -4 is accepted as OK. -- To view, visit http://gerrit.cloudera.org:8080/12065 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I0cc99f129f2048dbafbe7f5a51d1ea3a5005731a Gerrit-Change-Number: 12065 Gerrit-PatchSet: 8 Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Michael Ho <k...@cloudera.com> Gerrit-Reviewer: Pooja Nilangekar <pooja.nilange...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Comment-Date: Wed, 03 Apr 2019 15:22:20 +0000 Gerrit-HasComments: Yes