Lars Volker has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12065 )

Change subject: IMPALA-5843: Use page index in Parquet files to skip pages
......................................................................


Patch Set 9:

(21 comments)

http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/hdfs-parquet-scanner.h
File be/src/exec/parquet/hdfs-parquet-scanner.h:

http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/hdfs-parquet-scanner.h@552
PS8, Line 552:
> Added comments. Instead of 'fn_name', I modified this function to take a 's
Left some comment in the call-site: I think pulling that part out of the 
function made the calling code slightly less readable.


http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@644
PS8, Line 644:         
COUNTER_ADD(num_stats_filtered_row_groups_by_page_index_counter_, 1);
> See reply comment.
Independent of the discussion which counters to maintain, I took a while to 
understand how this can happen :). Maybe you want to leave a comment that 
explains how the row group statistics will not filter out a column but the 
pages will (e.g. with gaps between the pages)? I don't feel strongly about it, 
if you think it's obvious, feel free to leave it out.


http://gerrit.cloudera.org:8080/#/c/12065/9/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/12065/9/be/src/exec/parquet/hdfs-parquet-scanner.cc@700
PS9, Line 700:   if (stats_field == ColumnStatsReader::StatsField::MIN) {
Use a case-statement, now that it's an enum?


http://gerrit.cloudera.org:8080/#/c/12065/9/be/src/exec/parquet/hdfs-parquet-scanner.cc@758
PS9, Line 758:     const string& fn_name = eval->root().function_name();
             :     ColumnStatsReader::StatsField stats_field;
             :     if (fn_name == "lt" || fn_name == "le") {
             :       stats_field = ColumnStatsReader::StatsField::MIN;
             :     } else if (fn_name == "gt" || fn_name == "ge") {
             :       stats_field = ColumnStatsReader::StatsField::MAX;
             :     } else {
             :       DCHECK(false) << "Unsupported function name for statistics 
evaluation: "
             :                     << fn_name;
             :     }
This looks like a good candidate for another utility function to keep this one 
shorter, e.g. GetRequiredStatsField()?


http://gerrit.cloudera.org:8080/#/c/12065/9/be/src/exec/parquet/parquet-common-test.cc
File be/src/exec/parquet/parquet-common-test.cc:

http://gerrit.cloudera.org:8080/#/c/12065/9/be/src/exec/parquet/parquet-common-test.cc@55
PS9, Line 55:   ValidateRanges({{-1, 1}}, 10, {}, false);
I think wrapping this in something called ValidateRangesError() would make it 
more readable.


http://gerrit.cloudera.org:8080/#/c/12065/9/be/src/exec/parquet/parquet-common-test.cc@71
PS9, Line 71: result
nit: renaming this to "candidate_pages" would help with readibility


http://gerrit.cloudera.org:8080/#/c/12065/9/be/src/exec/parquet/parquet-common-test.cc@94
PS9, Line 94:   ValidatePages({0}, {{0, 9LL + INT_MAX}}, 100LL + INT_MAX, {0});
Maybe add one > UINT_MAX for good measure?


http://gerrit.cloudera.org:8080/#/c/12065/9/be/src/exec/parquet/parquet-common.h
File be/src/exec/parquet/parquet-common.h:

http://gerrit.cloudera.org:8080/#/c/12065/9/be/src/exec/parquet/parquet-common.h@49
PS9, Line 49:   if (lhs.first < rhs.first) return true;
I think this is the same as

  std::tie(lhs.first, lhs.second) < std::tie(rhs.first, rhs.second)


http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-common.h
File be/src/exec/parquet/parquet-common.h:

http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-common.h@74
PS8, Line 74: can p
> I only like to keep them const to avoid having typos in 'if' statements, e.
I don't feel strongly about it, but it always makes me wonder since we usually 
don't denote this in the signature. You could also do if (42 == num_rows) :). 
Either way works for me, feel free to leave as is.


http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-common.cc
File be/src/exec/parquet/parquet-common.cc:

http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-common.cc@95
PS8, Line 95:   DCHECK_LT(page_idx, page_locations.size());
> auto& deduces the const for itself, but I added the const keyword anyway fo
Oh, good to know, thx for the hint.


http://gerrit.cloudera.org:8080/#/c/12065/9/be/src/exec/parquet/parquet-page-index-test.cc
File be/src/exec/parquet/parquet-page-index-test.cc:

http://gerrit.cloudera.org:8080/#/c/12065/9/be/src/exec/parquet/parquet-page-index-test.cc@35
PS9, Line 35: void ConstructFakeRowGroups(const vector<RowGroupRanges>& 
all_row_group_ranges,
Methods in this file could benefit from some brief comments on what they do


http://gerrit.cloudera.org:8080/#/c/12065/9/be/src/exec/parquet/parquet-page-index.h
File be/src/exec/parquet/parquet-page-index.h:

http://gerrit.cloudera.org:8080/#/c/12065/9/be/src/exec/parquet/parquet-page-index.h@54
PS9, Line 54:   static void DeterminePageIndexRangeInFile(
A comment would be good here. One of the outputs could be returned here, e.g. 
contains_page_index.


http://gerrit.cloudera.org:8080/#/c/12065/9/be/src/exec/parquet/parquet-page-index.h@68
PS9, Line 68: P
nit: lowercase


http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-page-index.cc
File be/src/exec/parquet/parquet-page-index.cc:

http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-page-index.cc@54
PS8, Line 54:         ci_start = col_chunk.column_index_offset;
> I switched to min(min(a,b),c) and max(max(a,b),c).
Interesting find, thx for trying that out.


http://gerrit.cloudera.org:8080/#/c/12065/9/common/thrift/ImpalaService.thrift
File common/thrift/ImpalaService.thrift:

http://gerrit.cloudera.org:8080/#/c/12065/9/common/thrift/ImpalaService.thrift@385
PS9, Line 385: granurality
spelling


http://gerrit.cloudera.org:8080/#/c/12065/9/common/thrift/ImpalaService.thrift@387
PS9, Line 387:   READ_PARQUET_PAGE_INDEX = 79
All other Parquet query options start with PARQUET_..., can we do the same 
here, e.g. PARQUET_READ_PAGE_INDEX?


http://gerrit.cloudera.org:8080/#/c/12065/8/testdata/data/README
File testdata/data/README:

http://gerrit.cloudera.org:8080/#/c/12065/8/testdata/data/README@323
PS8, Line 323: ve Hive 2.1.1 with a m
> I rephrased it. I only modified Parquet-MR, then used the jars of it with H
I think it would be good to push that code somewhere, even if it's a personal 
repo on github. Otherwise it will be close to impossible to re-create these 
files. That's probably true for a bunch of files here though, and doing this 
shouldn't hold up the change.


http://gerrit.cloudera.org:8080/#/c/12065/9/testdata/data/README
File testdata/data/README:

http://gerrit.cloudera.org:8080/#/c/12065/9/testdata/data/README@353
PS9, Line 353: decimals. I used hive Hive 2.1.1 with a modified Parquet-MR, see 
description
nit: double word


http://gerrit.cloudera.org:8080/#/c/12065/9/testdata/data/README@402
PS9, Line 402: hive
nit: double word


http://gerrit.cloudera.org:8080/#/c/12065/9/testdata/data/README@409
PS9, Line 409: hive
nit: double word


http://gerrit.cloudera.org:8080/#/c/12065/8/testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-page-index.test
File 
testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-page-index.test:

http://gerrit.cloudera.org:8080/#/c/12065/8/testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-page-index.test@1
PS8, Line 1: # These tests check that the page selection and value 
value-skipping logic works well
> I added some explanation to the beginning of each files.
Thanks. Did you have anything particular in mind when writing the test queries, 
e.g. making sure that a particular corner case is covered? If so, I think it 
would greatly help the reader to add that as a brief comment to each query, 
e.g. similar to this file: 
https://github.com/apache/impala/blob/master/testdata/workloads/functional-query/queries/QueryTest/parquet-filtering.test#L259



--
To view, visit http://gerrit.cloudera.org:8080/12065
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0cc99f129f2048dbafbe7f5a51d1ea3a5005731a
Gerrit-Change-Number: 12065
Gerrit-PatchSet: 9
Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Lars Volker <l...@cloudera.com>
Gerrit-Reviewer: Michael Ho <k...@cloudera.com>
Gerrit-Reviewer: Pooja Nilangekar <pooja.nilange...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Comment-Date: Tue, 09 Apr 2019 21:53:14 +0000
Gerrit-HasComments: Yes

Reply via email to