Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/12065 )
Change subject: IMPALA-5843: Use page index in Parquet files to skip pages ...................................................................... Patch Set 8: (61 comments) Thanks for the comments! http://gerrit.cloudera.org:8080/#/c/12065/7//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/12065/7//COMMIT_MSG@31 PS7, Line 31: added > nit: double word Done http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.h File be/src/exec/parquet/hdfs-parquet-scanner.h: http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.h@424 PS7, Line 424: /// Pool to copy dictionary page buffer into. This pool is shared across all the > This is actually confusing because at line 638 of the .cc, this being empty I made 'page_filtering_' an output parameter and extended the comment. We use 'candidate_ranges_' at various places, even in the column readers, so it makes sense to keep it as a member. Adding a member variable as an output parameter to a member function would be a bit strange maybe. http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.h@427 PS7, Line 427: > This variable is only used in the NextRowGroup()->ProcessPageIndex()->Evalu Done http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc File be/src/exec/parquet/hdfs-parquet-scanner.cc: http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@119 PS7, Line 119: num_stats_filtered_row_groups_by_page_index_counter_ = > I think we should probably track the number of row groups with page indices Good idea! Added counter for it. http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@694 PS7, Line 694: > nit: single line? Done http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@706 PS7, Line 706: column_index.max_values[page_idx], slot); > I think a typedef or even a struct might make these easier to read, and you Done http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@708 PS7, Line 708: DCHECK(false) << "Unsupported function name for statistics evaluation: " > I had a look at this and the control flow is different enough that it would Thanks for the tips! http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@716 PS7, Line 716: single_process_page_index_timer.Start(); > Nit: Can combine the declarations of pos_field and missing_field onto one l Done http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@719 PS7, Line 719: for (auto& scalar_reader : scalar_readers_) scalar_reader->ResetPageFiltering(); > Any reason to handle pos_field=true differently from EvaluateStatsConjuncts I think we can 'continue' in this loop when pos_field=true since there will be no statistics about 'pos' in the column index. So I added 'pos_field' to the next 'if'. http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@721 PS7, Line 721: RETURN_IF_ERROR(CalculateCandidatePagesForColumns()); > Unnecessary blank line. Done http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@724 PS7, Line 724: return Status::OK(); > We could improve compactness of code by just inlining col_idx I use it at couple of places and somehow I find it slightly more readable with this variable. http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@734 PS7, Line 734: > 733-735 could be a single ternary statement I use a ternary but I kept 'col_orders' because it made the expression shorter. http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@736 PS7, Line 736: SchemaNode* node = nullptr; > 736-739 could maybe be a helper function Created helper 'CreateColumnStatsReader()' and use it 'EvaluateStatsConjuncts()' as well. http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@747 PS7, Line 747: parquet::ColumnIndex column_index; > 747-762 could probably be a static helper function, e.g. Done http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@772 PS7, Line 772: &scalar_reader->offset_index_)); > This comment is close to just restating the logic below. Could omit it or e Deleted this comment. http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@773 PS7, Line 773: row > nit: ends Done http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@780 PS7, Line 780: *filter_pages = true; > I feel like the logic from 766-780 could be condensed, e.g. using a ternary Created a helper function in parquet-common http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@789 PS7, Line 789: // for all columns. > This col_idx could also be inlined. Done http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@791 PS7, Line 791: "file '$1'", scalar_reader->schema_element().name, filename())); > Can it be <0? Otherwise be explicit about >0 or this could have weird conse Done http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@798 PS7, Line 798: > We're assuming that a valid Parquet file has an offset index if it has a co Extended the error message http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@803 PS7, Line 803: nReader* scalar_reader : > Should this method maybe also do the sorting? Then it's contract becomes sl I only did it this way to not make the signature awkward. If I change 'skip_ranges', then I can't take it as a const-ref, but as an output parameter, i.e. a pointer to it at the end of the parameter list. If I return the candidate ranges as an output parameter, then the syntax and callsite would a bit more weird. But I'm OK with it if you like it better. http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@813 PS7, Line 813: > Can you use a setter for this, or (ideally, if possible) compute those befo Now that ComputeCandidatePages() has an output parameter, setting scalar_reader->candidate_data_pages_ would need some boilerplate code if it had a setter. I extended the comment for candidate_data_pages_. I think the worst thing about it right now is the we set these members before calling InitScalarColumns() which calls BaseScalarColumnReader::Reset(). For now I'll leave it as it is but thinking about how to improve the control flow. http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@817 PS7, Line 817: DCHECK(row_batch != nullptr); > nit: it's weird that filtered pages means "eliminated" pages in the counter Yeah originaly I used "filtered" everywhere, but then I realised it has the opposite meaning for the profile counters, so I switched to "candidate". Unfortunately some leftover "filtered" still remained. http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@1150 PS7, Line 1150: K() > nit: spell out? Done http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@1154 PS7, Line 1154: DCHECK(dst_batch != nullptr); > Include filename in case we need to debug? Done http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.h File be/src/exec/parquet/parquet-column-readers.h: http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.h@380 PS7, Line 380: /// we issue a scan-range with sub-ranges that belong to the candidate data pages, i.e. > I had to look for it, maybe mention where this is initialized? Done http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.h@463 PS7, Line 463: > Does the different between primitive values and rows come from nested types Improved the comment http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.h@466 PS7, Line 466: > Add a comment what this returns. Done http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.h@467 PS7, Line 467: s how many > This is limited to the page, right? If so, maybe rename to SkipValuesInPage Done http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.h@470 PS7, Line 470: _rows' is 10, and ever > This reads like it gets the something like "the first row is at index 10 in Reformulated the sentence, hope it's better now. http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.h@491 PS7, Line 491: /// range. For nested columns, it returns 0 when we are processing values from the last > Could we move this implementation out of the header? I think generally only Moved the bigger function definitions to the .cc http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.h@518 PS7, Line 518: int FillPositionsInCandidateRange(int rows_remaining, int max_values, > Are "filtered range", "row range", and "filtered row range" below all the s Yeah, and unfortunately I shouldn't use 'filtered' anymore. I tried to consolidate the terminologies. http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.h@521 PS7, Line 521: /// Advance to the next candidate range that contains 'current_row_'. > DCHECK(!candidate_data_pages_.empty()) since this doesn't make sense unless Done http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.h@529 PS7, Line 529: bool SkipRowsInPage(); > DCHECK(!candidate_data_pages_.empty()) since this doesn't make sense unless I agree about renaming, but CurrentCandidateRangeOverlapsCurrentPage() is misleading because the position of 'current_row_' also counts. If there is an overlap between the range and the current page, but 'current_row_' already passed the range (possible for the last range), then this function needs to return false. How about CandidateRowsRemainingInCurrentPage()? http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.h@529 PS7, Line 529: bool SkipRowsInPage(); > I think we should also DCHECK for out-of-range here and above. Added DCHECKs for 'current_row_range_' here and above. Added DCHECK for "'current_row_' is outside of current range" above. http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.cc File be/src/exec/parquet/parquet-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.cc@391 PS7, Line 391: int64_t num_values) { > This doesn't work for plain encoded strings because they don't have an enco I have a specialization for ParquetPlainEncoder::EncodedLen<parquet::Type::BYTE_ARRAY> in parquet-common.h http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.cc@510 PS7, Line 510: // due to page filtering. > Is there some reason not to do this at the top of the loop? It feels like t I've put it here because at that point we surely know that we processed the values belonging to the def and rep levels. However, I think this would also work at the beginning of the loop. The tests pass and I couldn't come up with an edge case where it'd cause a problem yet. http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.cc@512 PS7, Line 512: if (IsLastCandidateRange()) { > This expression is weird - is the ! intended? Yeah, num_buffered_values_ == 0 is properly handled by NextPage(), and here it guards against peeking the rep levels when there aren't any levels anymore. Strangely PeekLevel() still returned 0 for me even when I consumed all the levels, but I suspect it was only some 0-padding at the end. http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.cc@517 PS7, Line 517: AdvanceCandidateRange(); : if (CandidateRowsRemainingInCurrentPage()) { : if(!SkipRowsInPage()) return false; : } else { : if (!JumpToNextPage()) return false; : } : } : if (SHOULD_TRIGGER_COL_READER_DEBUG_ACTION(val_count)) { : continue_execution &= ColReaderDebugAction(&val_count); : } > This part mostly uses global state, so it could go into its own method. I've put the first half into its own method. I think the 'else if' part has more sense here because it refers to 'val_count' and 'num_values'. http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.cc@555 PS7, Line 555: rep_level = 0; > I wonder if the code was more clear if we replace this with a more descript Done http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.cc@655 PS7, Line 655: _consume, tuple_size, tuple_m > I think an inline function might make the logic easier to follow. E.g. Page I use DoesPageFiltering() because we might not doing page filtering even if it is enabled by the relevant query option. http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.cc@1097 PS7, Line 1097: if (file_version.application == "parquet-mr" && file_version.VersionLt(1, 2, 9)) { > Should sub_ranges be built in a separate method? Done http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.cc@1105 PS7, Line 1105: } > nit: else if on some line. Done http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.cc@1109 PS7, Line 1109: if (!col_chunk.file_path.empty() && col_chunk.file_path != filename()) { > parquet-mr wrote things this way right? Might be worth mentioning for conte Done http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.cc@1514 PS7, Line 1514: > nit: remove some of the empty lines The pattern seems to be here is comment, then RETURN_IF_ERROR, then blank line. I removed the blank line before the break statement below. Should I remove more? http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.cc@1601 PS7, Line 1601: int64_t range_start = candidate_row_ranges[current_row_range_].first; > Isn't this equivalent to current_row_ >= ? Done http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.cc@1608 PS7, Line 1608: } > I guess we don't expect this to occur, but in case it does, we should inclu Done http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.cc@1653 PS7, Line 1653: level_ = def_levels_.CacheGetNext(); > Could this be the while condition? Done http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-common.h File be/src/exec/parquet/parquet-common.h: http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-common.h@55 PS7, Line 55: the codec > I think I'd prefer "compute" over "calculate" for things that are more than Done http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-common.h@55 PS7, Line 55: validate that the codec is a supported one, otherwise this will DCHECK. : THdfsCompression:: > I feel that sentence could go to the implementation. It's not really releva It's rather a declarative description of what this function does, actually it computes its result differently. I removed the comment not to cause any confusion, and added new comments to the implementation. http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-common.h@58 PS7, Line 58: /// Return the Parquet code for the give > nit: by convention we usually return non-podts through pointers. Done http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-common.h@61 PS7, Line 61: dec); > Does this mean "page number" as in "we need to read the 5th and 8th pages"? Done http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-common.h@64 PS7, Line 64: /// offset index > nit: return through pointer Done http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-common.cc File be/src/exec/parquet/parquet-common.cc: http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-common.cc@95 PS7, Line 95: & page_l > Maybe add a comment to explain what this tracks? Done http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-common.cc@96 PS7, Line 96: DCHECK_LT(page_idx, page_locations.size()); > I think we could also move the sorting here, or add a DCHECK using std::is_ Added DCHECK. http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-common.cc@96 PS7, Line 96: DCHECK_LT(page_idx, page_locations.size()); > It might be worth validating the sortedness precondition here by tracking t Added a DCHECK(std::is_sorted), this might be less efficient in debug builds, but at least doesn't pollute the core logic. http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-common.cc@100 PS7, Line 100: > Are these inclusive? Generally I feel that the terms [begin,end) signal hal Thanks for the advice. I switched to a struct with members 'first' and 'last'. http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-common.cc@118 PS7, Line 118: // We found a gap in 'skip_ranges', i.e. a row range that is not covered by > bottom and top also could be replaced by first / last, but only if you get Done http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-common.cc@123 PS7, Line 123: } > filtered_pages is ambiguous. Maybe surviving_pages? remaining_pages? candid Done http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-common.cc@123 PS7, Line 123: } > I'm in favor of surviving or remaining so it's clear that they're not candi I take it as a parameter now, and call it 'candidate_pages'. I didn't want to introduce another term like 'surviving' or 'remaining'. However, I can switch to 'surviving' and 'remaining' at all places if you want. http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-page-index.h File be/src/exec/parquet/parquet-page-index.h: http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-page-index.h@60 PS7, Line 60: to > to Done -- To view, visit http://gerrit.cloudera.org:8080/12065 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I0cc99f129f2048dbafbe7f5a51d1ea3a5005731a Gerrit-Change-Number: 12065 Gerrit-PatchSet: 8 Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Michael Ho <k...@cloudera.com> Gerrit-Reviewer: Pooja Nilangekar <pooja.nilange...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Comment-Date: Fri, 29 Mar 2019 16:18:55 +0000 Gerrit-HasComments: Yes