[Impala-ASF-CR] IMPALA-5843: Use page index in Parquet files to skip pages

Tim Armstrong (Code Review) Wed, 20 Mar 2019 18:00:59 -0700

Tim Armstrong has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12065 )


Change subject: IMPALA-5843: Use page index in Parquet files to skip pages
......................................................................


Patch Set 7:

(19 comments)

I spent a bit of time in the guts of the parquet code.

I ran some benchmarks against the TPC-H data and didn't find any regressions.

On the TPC-H data it's surprisingly hard to find queries that benefit from the 
column indices, because the data in most columns of lineitem is low-NDV and/or 
uniformly distributed, except for l_orderkey, which is totally ordered so the 
existing min/max stats are pretty effective.

Actually, now that I think about it, I might have loaded data with an Impala 
version that doesn't write page indexes by default, so maybe that explains it 
(but I can't tell directly from the profile)

http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.h
File be/src/exec/parquet/hdfs-parquet-scanner.h:

http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.h@424
PS7, Line 424:   std::vector<std::pair<int64_t, int64_t>> candidate_ranges_;
This is actually confusing because at line 638 of the .cc, this being empty 
signals that all the rows were eliminated.

I think (candidate_ranges_.empty() && page_filtering_) means that all pages 
were filtered out, but (candidate_ranges_.empty() && !page_filtering_) means 
that no pages were filtered out. I guess the comment is accurate once we're 
actually scanning the data, but it's misleading about the intermediate values 
of these variables during and after ProcessPageIndex() returns.

I have a couple of ideas for how to clarify it. I'd be ok with either:
1. Mention here that if all rows were eliminated, then the row group is skipped 
immediately after evaluating the page index.
2. Make the data flow clearer for ProcessPageIndex() clearer by having 
candidate_ranges_ and page_filtering_ be explicit output arguments. Then only 
the caller of ProcessPageIndex() actually sets candidate_ranges_.


http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.h@427
PS7, Line 427:   bool page_filtering_ = false;
This variable is only used in the 
NextRowGroup()->ProcessPageIndex()->EvaluatePageIndex() call change, so could 
just be passed as an argument. It would be good to reduce the amount of state 
and make the data flow explicit.


http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@119
PS7, Line 119:   num_stats_filtered_row_groups_by_page_index_counter_ =
I think we should probably track the number of row groups with page indices. 
Otherwise we don't know whether filtering was ineffective because the indices 
were missing or because they weren't selective


http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@708
PS7, Line 708:   //TODO: Try to remove code duplications between this and
I had a look at this and the control flow is different enough that it would be 
hard to combine them without it being hard to follow.

There might be some opportunity to condense the code or factor out logical 
functions to make the core algorithm easier to understand. I have some 
suggestions along those lines below. Some of this is a matter of taste (and you 
have good taste IMO) so please take or leave as you see fit.


http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@716
PS7, Line 716:     bool pos_field;
Nit: Can combine the declarations of pos_field and missing_field onto one line.


http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@719
PS7, Line 719:         schema_resolver_->ResolvePath(slot_desc->col_path(), 
&node, &pos_field,
Any reason to handle pos_field=true differently from EvaluateStatsConjuncts()?


http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@721
PS7, Line 721:
Unnecessary blank line.


http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@724
PS7, Line 724:     int col_idx = node->col_idx;
We could improve compactness of code by just inlining col_idx


http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@734
PS7, Line 734:     const parquet::ColumnOrder* col_order = nullptr;
733-735 could be a single ternary statement


http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@736
PS7, Line 736:     ColumnStatsReader stats_reader(col_chunk, col_type, 
col_order, *node->element);
736-739 could maybe be a helper function


http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@747
PS7, Line 747:       bool value_read = false;
747-762 could probably be a static helper function, e.g.

  bool ReadStatFromIndex(const parquet::ColumnIndex&, int page_idx, const 
string& fn_name, bool* is_null_page, void* slot)


http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@772
PS7, Line 772:         // For the last page in the RowGroup, the skip ends with 
the number of rows in
This comment is close to just restating the logic below. Could omit it or 
explain the why. E.g. "The end row index for the page must be inferred from 
other metadata."


http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@780
PS7, Line 780:         skip_ranges.push_back({begin_row, end_row});
I feel like the logic from 766-780 could be condensed, e.g. using a ternary for 
end_row and maybe just inlining begin_row and eliminating the intermediate 
variable. OK as is but I feel like this function is a bit "noisy" and the core 
algorithm is obscured.


http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@789
PS7, Line 789:     int col_idx = scalar_reader->col_idx();
This col_idx could also be inlined.


http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@1154
PS7, Line 1154:       return Status("Top level rows aren't in sync during page 
filtering!");
Include filename in case we need to debug?


http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.h
File be/src/exec/parquet/parquet-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.h@521
PS7, Line 521:     auto& filtered_row_ranges = parent_->candidate_ranges_;
DCHECK(!candidate_data_pages_.empty()) since this doesn't make sense unless 
index filtering is enabled.


http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.h@529
PS7, Line 529:     auto current_range = 
parent_->candidate_ranges_[current_row_range_];
DCHECK(!candidate_data_pages_.empty()) since this doesn't make sense unless 
index filtering is enabled.

I feel like it might be clearer if this was 
CurrentCandidateRangeOverlapsCurrentPage() - that's describing more directly 
what it's computing, instead of what action the caller will take.


http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.cc
File be/src/exec/parquet/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.cc@510
PS7, Line 510:     if (!candidate_data_pages_.empty() &&
Is there some reason not to do this at the top of the loop? It feels like this 
fits with the NextPage() logic and it might be good to have page advancement 
consolidated in one place.

OK to ignore, but might be worth leaving a comment if there was some rationale 
for this.


http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-column-readers.cc@1601
PS7, Line 1601:   while (current_row_ + 1 > 
filtered_row_ranges[current_row_range_].second) {
Isn't this equivalent to current_row_ >= ?



--
To view, visit http://gerrit.cloudera.org:8080/12065
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0cc99f129f2048dbafbe7f5a51d1ea3a5005731a
Gerrit-Change-Number: 12065
Gerrit-PatchSet: 7
Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Lars Volker <l...@cloudera.com>
Gerrit-Reviewer: Michael Ho <k...@cloudera.com>
Gerrit-Reviewer: Pooja Nilangekar <pooja.nilange...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Comment-Date: Thu, 21 Mar 2019 00:44:08 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5843: Use page index in Parquet files to skip pages

Reply via email to