Tim Armstrong has posted comments on this change. ( http://gerrit.cloudera.org:8080/12065 )
Change subject: IMPALA-5843: Use page index in Parquet files to skip pages ...................................................................... Patch Set 7: (19 comments) I spent a bit of time in the guts of the parquet code. I ran some benchmarks against the TPC-H data and didn't find any regressions. On the TPC-H data it's surprisingly hard to find queries that benefit from the column indices, because the data in most columns of lineitem is low-NDV and/or uniformly distributed, except for l_orderkey, which is totally ordered so the existing min/max stats are pretty effective. Actually, now that I think about it, I might have loaded data with an Impala version that doesn't write page indexes by default, so maybe that explains it (but I can't tell directly from the profile) http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.h File be/src/exec/parquet/hdfs-parquet-scanner.h: http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.h@424 PS7, Line 424: std::vector<std::pair<int64_t, int64_t>> candidate_ranges_; This is actually confusing because at line 638 of the .cc, this being empty signals that all the rows were eliminated. I think (candidate_ranges_.empty() && page_filtering_) means that all pages were filtered out, but (candidate_ranges_.empty() && !page_filtering_) means that no pages were filtered out. I guess the comment is accurate once we're actually scanning the data, but it's misleading about the intermediate values of these variables during and after ProcessPageIndex() returns. I have a couple of ideas for how to clarify it. I'd be ok with either: 1. Mention here that if all rows were eliminated, then the row group is skipped immediately after evaluating the page index. 2. Make the data flow clearer for ProcessPageIndex() clearer by having candidate_ranges_ and page_filtering_ be explicit output arguments. Then only the caller of ProcessPageIndex() actually sets candidate_ranges_. http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.h@427 PS7, Line 427: bool page_filtering_ = false; This variable is only used in the NextRowGroup()->ProcessPageIndex()->EvaluatePageIndex() call change, so could just be passed as an argument. It would be good to reduce the amount of state and make the data flow explicit. http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc File be/src/exec/parquet/hdfs-parquet-scanner.cc: http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@119 PS7, Line 119: num_stats_filtered_row_groups_by_page_index_counter_ = I think we should probably track the number of row groups with page indices. Otherwise we don't know whether filtering was ineffective because the indices were missing or because they weren't selective http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@708 PS7, Line 708: //TODO: Try to remove code duplications between this and I had a look at this and the control flow is different enough that it would be hard to combine them without it being hard to follow. There might be some opportunity to condense the code or factor out logical functions to make the core algorithm easier to understand. I have some suggestions along those lines below. Some of this is a matter of taste (and you have good taste IMO) so please take or leave as you see fit. http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@716 PS7, Line 716: bool pos_field; Nit: Can combine the declarations of pos_field and missing_field onto one line. http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@719 PS7, Line 719: schema_resolver_->ResolvePath(slot_desc->col_path(), &node, &pos_field, Any reason to handle pos_field=true differently from EvaluateStatsConjuncts()? http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@721 PS7, Line 721: Unnecessary blank line. http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@724 PS7, Line 724: int col_idx = node->col_idx; We could improve compactness of code by just inlining col_idx http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@734 PS7, Line 734: const parquet::ColumnOrder* col_order = nullptr; 733-735 could be a single ternary statement http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@736 PS7, Line 736: ColumnStatsReader stats_reader(col_chunk, col_type, col_order, *node->element); 736-739 could maybe be a helper function http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@747 PS7, Line 747: bool value_read = false; 747-762 could probably be a static helper function, e.g. bool ReadStatFromIndex(const parquet::ColumnIndex&, int page_idx, const string& fn_name, bool* is_null_page, void* slot) http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@772 PS7, Line 772: // For the last page in the RowGroup, the skip ends with the number of rows in This comment is close to just restating the logic below. Could omit it or explain the why. E.g. "The end row index for the page must be inferred from other metadata." http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@780 PS7, Line 780: skip_ranges.push_back({begin_row, end_row}); I feel like the logic from 766-780 could be condensed, e.g. using a ternary for end_row and maybe just inlining begin_row and eliminating the intermediate variable. OK as is but I feel like this function is a bit "noisy" and the core algorithm is obscured. http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@789 PS7, Line 789: int col_idx = scalar_reader->col_idx(); This col_idx could also be inlined. http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@1154 PS7, Line 1154: return Status("Top level rows aren't in sync during page filtering!"); Include filename in case we need to debug? http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.h File be/src/exec/parquet/parquet-column-readers.h: http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.h@521 PS7, Line 521: auto& filtered_row_ranges = parent_->candidate_ranges_; DCHECK(!candidate_data_pages_.empty()) since this doesn't make sense unless index filtering is enabled. http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.h@529 PS7, Line 529: auto current_range = parent_->candidate_ranges_[current_row_range_]; DCHECK(!candidate_data_pages_.empty()) since this doesn't make sense unless index filtering is enabled. I feel like it might be clearer if this was CurrentCandidateRangeOverlapsCurrentPage() - that's describing more directly what it's computing, instead of what action the caller will take. http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.cc File be/src/exec/parquet/parquet-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.cc@510 PS7, Line 510: if (!candidate_data_pages_.empty() && Is there some reason not to do this at the top of the loop? It feels like this fits with the NextPage() logic and it might be good to have page advancement consolidated in one place. OK to ignore, but might be worth leaving a comment if there was some rationale for this. http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.cc@1601 PS7, Line 1601: while (current_row_ + 1 > filtered_row_ranges[current_row_range_].second) { Isn't this equivalent to current_row_ >= ? -- To view, visit http://gerrit.cloudera.org:8080/12065 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I0cc99f129f2048dbafbe7f5a51d1ea3a5005731a Gerrit-Change-Number: 12065 Gerrit-PatchSet: 7 Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Michael Ho <k...@cloudera.com> Gerrit-Reviewer: Pooja Nilangekar <pooja.nilange...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Comment-Date: Thu, 21 Mar 2019 00:44:08 +0000 Gerrit-HasComments: Yes