Lars Volker has posted comments on this change. (
http://gerrit.cloudera.org:8080/12065 )
Change subject: IMPALA-5843: Use page index in Parquet files to skip pages
......................................................................
Patch Set 7:
(31 comments)
Did a first pass over the main {h,cc} files and left some comments
http://gerrit.cloudera.org:8080/#/c/12065/7//COMMIT_MSG
Commit Message:
http://gerrit.cloudera.org:8080/#/c/12065/7//COMMIT_MSG@31
PS7, Line 31: added
nit: double word
http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:
http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@694
PS7, Line 694: for (auto& scalar_reader : scalar_readers_) {
nit: single line?
http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@706
PS7, Line 706: vector<pair<int64_t, int64_t>> skip_ranges;
I think a typedef or even a struct might make these easier to read, and you
could use a single tuple instead of begin_row and end_row below.
http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@791
PS7, Line 791: if (col_chunk.offset_index_length != 0) {
Can it be <0? Otherwise be explicit about >0 or this could have weird
consequences for a broken file (e.g. in our fuzz testing).
http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@803
PS7, Line 803: CalculateCandidateRanges
Should this method maybe also do the sorting? Then it's contract becomes
slightly simpler (i.e., input can be unsorted).
http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@813
PS7, Line 813: d
Can you use a setter for this, or (ideally, if possible) compute those before
constructing the page readers and pass them into the c'tor? I looked through
the column readers to see where they are initialized before resorting to grep
to come here.
http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@1150
PS7, Line 1150: tlr
nit: spell out?
http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.h
File be/src/exec/parquet/parquet-column-readers.h:
http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.h@380
PS7, Line 380: /// we will not even see the bytes of the filtered out pages.
I had to look for it, maybe mention where this is initialized?
http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.h@463
PS7, Line 463: /// invokes SkipValues().
Does the different between primitive values and rows come from nested types? If
so, maybe mention in a small comment?
Please also add a comment what this returns?
http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.h@466
PS7, Line 466: /// Skip values in the page data.
Add a comment what this returns.
http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.h@467
PS7, Line 467: SkipValues
This is limited to the page, right? If so, maybe rename to SkipValuesInPage()?
http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.h@470
PS7, Line 470: index of the first row
This reads like it gets the something like "the first row is at index 10 in the
page" where I think it should return "this page starts with row 23". Can you
clarify the comment?
http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.h@491
PS7, Line 491: int FillPositionsInFilteredRange(int rows_remaining, int
max_values,
Could we move this implementation out of the header? I think generally only
one-liners (full definition) should be here.
http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.h@518
PS7, Line 518: /// Advance to the next filtered range that contains
'current_row_'.
Are "filtered range", "row range", and "filtered row range" below all the same
thing"
http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.h@529
PS7, Line 529: auto current_range =
parent_->candidate_ranges_[current_row_range_];
> DCHECK(!candidate_data_pages_.empty()) since this doesn't make sense unless
I think we should also DCHECK for out-of-range here and above.
http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.cc
File be/src/exec/parquet/parquet-column-readers.cc:
http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.cc@391
PS7, Line 391: bool ScalarColumnReader<InternalType, PARQUET_TYPE,
MATERIALIZED>::SkipValues(
This doesn't work for plain encoded strings because they don't have an encoded
len, no?
http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.cc@517
PS7, Line 517: if (NeedToSkipRowsInCurrentPage()) {
: auto current_range =
parent_->candidate_ranges_[current_row_range_];
: int64_t skip_rows = current_range.first - current_row_ -
1;
: DCHECK_GE(skip_rows, 0);
: if (!SkipTopLevelRows(skip_rows)) return false;
: } else if (candidate_page_idx_ ==
candidate_data_pages_.size() - 1) {
: *num_values = val_count;
: num_buffered_values_ = 0;
: return val_count > 0;
: }
This part mostly uses global state, so it could go into its own method.
http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.cc@555
PS7, Line 555: !candidate_data_pages_.empty()
I wonder if the code was more clear if we replace this with a more descriptive
function, e.g. DoesPageFiltering() or so, especially in cases where we check
candidate_data_pages_ without then actually accessing them below the check?
http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.cc@1097
PS7, Line 1097: vector<ScanRange::SubRange> sub_ranges;
Should sub_ranges be built in a separate method?
http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.cc@1514
PS7, Line 1514:
nit: remove some of the empty lines
http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.cc@1653
PS7, Line 1653: num_rows == 0 && rep_levels_.PeekLevel() == 0
Could this be the while condition?
http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-common.h
File be/src/exec/parquet/parquet-common.h:
http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-common.h@55
PS7, Line 55: calculates
I think I'd prefer "compute" over "calculate" for things that are more than a
single expression, here and elsewhere.
http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-common.h@55
PS7, Line 55: This function calculates the union of 'skip_ranges' and subtracts
it from
: /// [0, num_rows).
I feel that sentence could go to the implementation. It's not really relevant
to the caller how this method accomplishes what it does, but it helps
understand the implementation
http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-common.h@58
PS7, Line 58: std::vector<std::pair<int64_t, int64_t>>
nit: by convention we usually return non-podts through pointers.
http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-common.h@61
PS7, Line 61: the page indexes
Does this mean "page number" as in "we need to read the 5th and 8th pages"?
Then we could say "computes the pages that intersect..." and explain how these
pages are returned.
http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-common.h@64
PS7, Line 64: std::vector<int>
nit: return through pointer
http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-common.cc
File be/src/exec/parquet/parquet-common.cc:
http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-common.cc@95
PS7, Line 95: skip_end
Maybe add a comment to explain what this tracks?
http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-common.cc@96
PS7, Line 96: for (auto& skip_range : skip_ranges) {
> It might be worth validating the sortedness precondition here by tracking t
I think we could also move the sorting here, or add a DCHECK using
std::is_sorted.
http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-common.cc@100
PS7, Line 100: - 1
Are these inclusive? Generally I feel that the terms [begin,end) signal
half-open intervals, where [first, last] signal closed. I'm good with keeping
the names but it would probably be good to at least have comments that mention
how to interpret the ends of the intervals. If you want to switch to first/last
we should have a struct to avoid the confusion with pair<>::first.
http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-common.cc@118
PS7, Line 118: int64_t higher_bottom = std::max(lhs.first, rhs.first);
bottom and top also could be replaced by first / last, but only if you get rid
of the pair. Given the subtleties around half-open vs closed, a struct
RowInterval with a comment somewhere might be good.
http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-common.cc@123
PS7, Line 123: vector<int> filtered_pages;
> filtered_pages is ambiguous. Maybe surviving_pages? remaining_pages? candid
I'm in favor of surviving or remaining so it's clear that they're not
candidates for further pruning.
--
To view, visit http://gerrit.cloudera.org:8080/12065
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0cc99f129f2048dbafbe7f5a51d1ea3a5005731a
Gerrit-Change-Number: 12065
Gerrit-PatchSet: 7
Gerrit-Owner: Zoltan Borok-Nagy <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Lars Volker <[email protected]>
Gerrit-Reviewer: Michael Ho <[email protected]>
Gerrit-Reviewer: Pooja Nilangekar <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Fri, 22 Mar 2019 23:44:42 +0000
Gerrit-HasComments: Yes