Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12065 )

Change subject: IMPALA-5843: Use page index in Parquet files to skip pages
......................................................................


Patch Set 7:

(11 comments)

Spent some time trying to understand it a bit better. I'm also playing around 
with it a bit and running a single node benchmark.

http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@773
PS7, Line 773: end
nit: ends


http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@798
PS7, Line 798:       return Status(Substitute("No offset index for '$0'",
We're assuming that a valid Parquet file has an offset index if it has a column 
index (it doesn't make much sense otherwise)? And therefore its OK to fail 
scanning if we find such a file?

I think this makes sense, but it would help readers to be explicit about that 
here.


http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@817
PS7, Line 817:     COUNTER_ADD(num_stats_filtered_pages_counter_,
nit: it's weird that filtered pages means "eliminated" pages in the counter but 
surviving pages in the filtered_pages_count variable.


http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.cc
File be/src/exec/parquet/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.cc@512
PS7, Line 512:         !num_buffered_values_ == 0 &&
This expression is weird - is the ! intended?


http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.cc@655
PS7, Line 655: candidate_data_pages_.empty()
I think an inline function might make the logic easier to follow. E.g. 
PageFilteringEnabled(). When reading I find myself having to negate it in my 
head each time (i.e. "empty means that filtering is enabled").


http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.cc@1105
PS7, Line 1105:     }
nit: else if on some line.


http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.cc@1109
PS7, Line 1109:       // is actually the location of the dictionary page.
parquet-mr wrote things this way right? Might be worth mentioning for context?


http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.cc@1608
PS7, Line 1608:     if (!SkipTopLevelRows(skip_rows)) return Status("Couldn't 
skip rows.");
I guess we don't expect this to occur, but in case it does, we should include 
the file name to allow debugging the error.


http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-common.cc
File be/src/exec/parquet/parquet-common.cc:

http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-common.cc@96
PS7, Line 96:   for (auto& skip_range : skip_ranges) {
It might be worth validating the sortedness precondition here by tracking the 
previous range start.


http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-common.cc@123
PS7, Line 123:   vector<int> filtered_pages;
filtered_pages is ambiguous. Maybe surviving_pages? remaining_pages? 
candidate_pages?


http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-page-index.h
File be/src/exec/parquet/parquet-page-index.h:

http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-page-index.h@60
PS7, Line 60: the
to



--
To view, visit http://gerrit.cloudera.org:8080/12065
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0cc99f129f2048dbafbe7f5a51d1ea3a5005731a
Gerrit-Change-Number: 12065
Gerrit-PatchSet: 7
Gerrit-Owner: Zoltan Borok-Nagy <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Lars Volker <[email protected]>
Gerrit-Reviewer: Michael Ho <[email protected]>
Gerrit-Reviewer: Pooja Nilangekar <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Wed, 20 Mar 2019 01:20:54 +0000
Gerrit-HasComments: Yes

Reply via email to