[Impala-ASF-CR] IMPALA-5843: Use page index in Parquet files to skip pages

Lars Volker (Code Review) Fri, 22 Mar 2019 16:45:25 -0700

Lars Volker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12065 )


Change subject: IMPALA-5843: Use page index in Parquet files to skip pages
......................................................................


Patch Set 7:

(31 comments)

Did a first pass over the main {h,cc} files and left some comments

http://gerrit.cloudera.org:8080/#/c/12065/7//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12065/7//COMMIT_MSG@31
PS7, Line 31: added
nit: double word


http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@694
PS7, Line 694:   for (auto& scalar_reader : scalar_readers_) {
nit: single line?


http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@706
PS7, Line 706:   vector<pair<int64_t, int64_t>> skip_ranges;
I think a typedef or even a struct might make these easier to read, and you 
could use a single tuple instead of begin_row and end_row below.


http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@791
PS7, Line 791:     if (col_chunk.offset_index_length != 0) {
Can it be <0? Otherwise be explicit about >0 or this could have weird 
consequences for a broken file (e.g. in our fuzz testing).


http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@803
PS7, Line 803: CalculateCandidateRanges
Should this method maybe also do the sorting? Then it's contract becomes 
slightly simpler (i.e., input can be unsorted).


http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@813
PS7, Line 813: d
Can you use a setter for this, or (ideally, if possible) compute those before 
constructing the page readers and pass them into the c'tor? I looked through 
the column readers to see where they are initialized before resorting to grep 
to come here.


http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@1150
PS7, Line 1150: tlr
nit: spell out?


http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.h
File be/src/exec/parquet/parquet-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.h@380
PS7, Line 380:   /// we will not even see the bytes of the filtered out pages.
I had to look for it, maybe mention where this is initialized?


http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.h@463
PS7, Line 463:   /// invokes SkipValues().
Does the different between primitive values and rows come from nested types? If 
so, maybe mention in a small comment?

Please also add a comment what this returns?


http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.h@466
PS7, Line 466:   /// Skip values in the page data.
Add a comment what this returns.


http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.h@467
PS7, Line 467: SkipValues
This is limited to the page, right? If so, maybe rename to SkipValuesInPage()?


http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.h@470
PS7, Line 470: index of the first row
This reads like it gets the something like "the first row is at index 10 in the 
page" where I think it should return "this page starts with row 23". Can you 
clarify the comment?


http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.h@491
PS7, Line 491:   int FillPositionsInFilteredRange(int rows_remaining, int 
max_values,
Could we move this implementation out of the header? I think generally only 
one-liners (full definition) should be here.


http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.h@518
PS7, Line 518:   /// Advance to the next filtered range that contains 
'current_row_'.
Are "filtered range", "row range", and "filtered row range" below all the same 
thing"


http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.h@529
PS7, Line 529:     auto current_range = 
parent_->candidate_ranges_[current_row_range_];
> DCHECK(!candidate_data_pages_.empty()) since this doesn't make sense unless
I think we should also DCHECK for out-of-range here and above.


http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.cc
File be/src/exec/parquet/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.cc@391
PS7, Line 391: bool ScalarColumnReader<InternalType, PARQUET_TYPE, 
MATERIALIZED>::SkipValues(
This doesn't work for plain encoded strings because they don't have an encoded 
len, no?


http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.cc@517
PS7, Line 517: if (NeedToSkipRowsInCurrentPage()) {
             :         auto current_range = 
parent_->candidate_ranges_[current_row_range_];
             :         int64_t skip_rows = current_range.first - current_row_ - 
1;
             :         DCHECK_GE(skip_rows, 0);
             :         if (!SkipTopLevelRows(skip_rows)) return false;
             :       } else if (candidate_page_idx_ == 
candidate_data_pages_.size() - 1) {
             :         *num_values = val_count;
             :         num_buffered_values_ = 0;
             :         return val_count > 0;
             :       }
This part mostly uses global state, so it could go into its own method.


http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.cc@555
PS7, Line 555: !candidate_data_pages_.empty()
I wonder if the code was more clear if we replace this with a more descriptive 
function, e.g. DoesPageFiltering() or so, especially in cases where we check 
candidate_data_pages_ without then actually accessing them below the check?


http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.cc@1097
PS7, Line 1097:   vector<ScanRange::SubRange> sub_ranges;
Should sub_ranges be built in a separate method?


http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.cc@1514
PS7, Line 1514:
nit: remove some of the empty lines


http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.cc@1653
PS7, Line 1653: num_rows == 0 && rep_levels_.PeekLevel() == 0
Could this be the while condition?


http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-common.h
File be/src/exec/parquet/parquet-common.h:

http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-common.h@55
PS7, Line 55: calculates
I think I'd prefer "compute" over "calculate" for things that are more than a 
single expression, here and elsewhere.


http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-common.h@55
PS7, Line 55: This function calculates the union of 'skip_ranges' and subtracts 
it from
            : /// [0, num_rows).
I feel that sentence could go to the implementation. It's not really relevant 
to the caller how this method accomplishes what it does,  but it helps 
understand the implementation


http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-common.h@58
PS7, Line 58: std::vector<std::pair<int64_t, int64_t>>
nit: by convention we usually return non-podts through pointers.


http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-common.h@61
PS7, Line 61: the page indexes
Does this mean "page number" as in "we need to read the 5th and 8th pages"? 
Then we could say "computes the pages that intersect..." and explain how these 
pages are returned.


http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-common.h@64
PS7, Line 64: std::vector<int>
nit: return through pointer


http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-common.cc
File be/src/exec/parquet/parquet-common.cc:

http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-common.cc@95
PS7, Line 95: skip_end
Maybe add a comment to explain what this tracks?


http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-common.cc@96
PS7, Line 96:   for (auto& skip_range : skip_ranges) {
> It might be worth validating the sortedness precondition here by tracking t
I think we could also move the sorting here, or add a DCHECK using 
std::is_sorted.


http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-common.cc@100
PS7, Line 100:  - 1
Are these inclusive? Generally I feel that the terms [begin,end) signal 
half-open intervals, where [first, last] signal closed. I'm good with keeping 
the names but it would probably be good to at least have comments that mention 
how to interpret the ends of the intervals. If you want to switch to first/last 
we should have a struct to avoid the confusion with pair<>::first.


http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-common.cc@118
PS7, Line 118:     int64_t higher_bottom = std::max(lhs.first, rhs.first);
bottom and top also could be replaced by first / last, but only if you get rid 
of the pair. Given the subtleties around half-open vs closed, a struct 
RowInterval with a comment somewhere might be good.


http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-common.cc@123
PS7, Line 123:   vector<int> filtered_pages;
> filtered_pages is ambiguous. Maybe surviving_pages? remaining_pages? candid
I'm in favor of surviving or remaining so it's clear that they're not 
candidates for further pruning.



--
To view, visit http://gerrit.cloudera.org:8080/12065
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0cc99f129f2048dbafbe7f5a51d1ea3a5005731a
Gerrit-Change-Number: 12065
Gerrit-PatchSet: 7
Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Lars Volker <l...@cloudera.com>
Gerrit-Reviewer: Michael Ho <k...@cloudera.com>
Gerrit-Reviewer: Pooja Nilangekar <pooja.nilange...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Comment-Date: Fri, 22 Mar 2019 23:44:42 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5843: Use page index in Parquet files to skip pages

Reply via email to