Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12065 )

Change subject: IMPALA-5843: Use page index in Parquet files to skip pages
......................................................................


Patch Set 10:

(22 comments)

http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/hdfs-parquet-scanner.h
File be/src/exec/parquet/hdfs-parquet-scanner.h:

http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/hdfs-parquet-scanner.h@552
PS8, Line 552:   /// column is specified by 'column_index'. 'stats_field' 
specifies whether we should
> Left some comment in the call-site: I think pulling that part out of the fu
Added GetRequiredStatsField() utility function to make the call more readable.

It just didn't feel natural to me to have the 'fn_name' to stats_field logic in 
ReadStatFromIndex().


http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@644
PS8, Line 644:         // {min: 10, max: 20}, and query is 'select * from T 
where A = 8'.
> Independent of the discussion which counters to maintain, I took a while to
Ah sorry, I wrote an explanation, but then I replaced my whole comment with 
"See reply...". :\

It's not obvious at all, in fact, the early patch sets didn't handle that case. 
Added comment and example here.


http://gerrit.cloudera.org:8080/#/c/12065/9/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/12065/9/be/src/exec/parquet/hdfs-parquet-scanner.cc@700
PS9, Line 700:   switch (stats_field) {
> Use a case-statement, now that it's an enum?
Done


http://gerrit.cloudera.org:8080/#/c/12065/9/be/src/exec/parquet/hdfs-parquet-scanner.cc@758
PS9, Line 758:     if (col_chunk.column_index_length == 0) continue;
             :
             :     parquet::ColumnIndex column_index;
             :     
RETURN_IF_ERROR(page_index_.DeserializeColumnIndex(col_chunk, &column_index));
             :
             :     const ColumnType& col_type = slot_desc->type();
             :     const vector<parquet::ColumnOrder>& col_orders = 
file_metadata_.column_orders;
             :     const parquet::ColumnOrder* col_order = col_idx < 
col_orders.size() ?
             :         &col_orders[col_idx] : nullptr;
             :     C
> This looks like a good candidate for another utility function to keep this
Done


http://gerrit.cloudera.org:8080/#/c/12065/9/be/src/exec/parquet/parquet-common-test.cc
File be/src/exec/parquet/parquet-common-test.cc:

http://gerrit.cloudera.org:8080/#/c/12065/9/be/src/exec/parquet/parquet-common-test.cc@55
PS9, Line 55:   ValidateRanges({{1, 2}, {1, 4}, {1, 8}}, 10, {{0, 0}, {9, 9}});
> I think wrapping this in something called ValidateRangesError() would make
Done


http://gerrit.cloudera.org:8080/#/c/12065/9/be/src/exec/parquet/parquet-common-test.cc@71
PS9, Line 71: ageLoc
> nit: renaming this to "candidate_pages" would help with readibility
Done


http://gerrit.cloudera.org:8080/#/c/12065/9/be/src/exec/parquet/parquet-common-test.cc@94
PS9, Line 94:   ValidatePages({0, 10, 20, 50, 70}, {{0, 9}}, 100, {0});
> Maybe add one > UINT_MAX for good measure?
Done


http://gerrit.cloudera.org:8080/#/c/12065/9/be/src/exec/parquet/parquet-common.h
File be/src/exec/parquet/parquet-common.h:

http://gerrit.cloudera.org:8080/#/c/12065/9/be/src/exec/parquet/parquet-common.h@49
PS9, Line 49:   return std::tie(lhs.first, lhs.last) < std::tie(rhs.first, 
rhs.last);
> I think this is the same as
Thanks! Done.


http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-common.h
File be/src/exec/parquet/parquet-common.h:

http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-common.h@74
PS8, Line 74: 'cand
> I don't feel strongly about it, but it always makes me wonder since we usua
Yeah, those "Yoda conditions" never felt natural to me :)

OK, I'll leave it as it is if you don't mind.


http://gerrit.cloudera.org:8080/#/c/12065/9/be/src/exec/parquet/parquet-page-index-test.cc
File be/src/exec/parquet/parquet-page-index-test.cc:

http://gerrit.cloudera.org:8080/#/c/12065/9/be/src/exec/parquet/parquet-page-index-test.cc@35
PS9, Line 35: /// Creates a vector of parquet::RowGroup objects based on data in
> Methods in this file could benefit from some brief comments on what they do
Added comment.


http://gerrit.cloudera.org:8080/#/c/12065/9/be/src/exec/parquet/parquet-page-index-test.cc@108
PS9, Line 108:   ValidatePageIndexRange({{{10, 5, 15, 5}}, {{20, 10, -1, -1}}}, 
true, 10, 20);
> line too long (94 > 90)
Done


http://gerrit.cloudera.org:8080/#/c/12065/9/be/src/exec/parquet/parquet-page-index.h
File be/src/exec/parquet/parquet-page-index.h:

http://gerrit.cloudera.org:8080/#/c/12065/9/be/src/exec/parquet/parquet-page-index.h@54
PS9, Line 54:   /// Determines the file range that contains the page index for 
all row groups.
> A comment would be good here. One of the outputs could be returned here, e.
Added comment.


http://gerrit.cloudera.org:8080/#/c/12065/9/be/src/exec/parquet/parquet-page-index.h@68
PS9, Line 68: p
> nit: lowercase
Done


http://gerrit.cloudera.org:8080/#/c/12065/9/common/thrift/ImpalaService.thrift
File common/thrift/ImpalaService.thrift:

http://gerrit.cloudera.org:8080/#/c/12065/9/common/thrift/ImpalaService.thrift@385
PS9, Line 385: granularity
> spelling
Done


http://gerrit.cloudera.org:8080/#/c/12065/9/common/thrift/ImpalaService.thrift@387
PS9, Line 387:   PARQUET_READ_PAGE_INDEX = 79
> I disagree here - having a separate query option can be useful during bench
I agree with Csaba. I think it's safer to have the option to turn it off 
without any side-effect. Maybe later we can deprecate this option.


http://gerrit.cloudera.org:8080/#/c/12065/9/common/thrift/ImpalaService.thrift@387
PS9, Line 387:   PARQUET_READ_PAGE_INDEX = 79
> All other Parquet query options start with PARQUET_..., can we do the same
Done


http://gerrit.cloudera.org:8080/#/c/12065/8/testdata/data/README
File testdata/data/README:

http://gerrit.cloudera.org:8080/#/c/12065/8/testdata/data/README@323
PS8, Line 323: two columns, one is a
> I think it would be good to push that code somewhere, even if it's a person
Added link to my personal github repo.


http://gerrit.cloudera.org:8080/#/c/12065/9/testdata/data/README
File testdata/data/README:

http://gerrit.cloudera.org:8080/#/c/12065/9/testdata/data/README@353
PS9, Line 353: Contains two columns, one is a decimal column, the other is an 
array of arrays of
> nit: double word
Done


http://gerrit.cloudera.org:8080/#/c/12065/9/testdata/data/README@402
PS9, Line 402:  fro
> nit: double word
Done


http://gerrit.cloudera.org:8080/#/c/12065/9/testdata/data/README@409
PS9, Line 409:  fro
> nit: double word
Done


http://gerrit.cloudera.org:8080/#/c/12065/8/testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-page-index.test
File 
testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-page-index.test:

http://gerrit.cloudera.org:8080/#/c/12065/8/testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-page-index.test@1
PS8, Line 1: # These tests check that the page selection and value 
value-skipping logic works well
> Thanks. Did you have anything particular in mind when writing the test quer
I added query-specific comments to 'parquet-page-index.test' and 
'nested-types-parquet-page-index.test' where I had different test cases in mind.

The other '.test' files contain quite random predicates so I only extended the 
generic comment a bit.


http://gerrit.cloudera.org:8080/#/c/12065/9/tests/query_test/test_parquet_stats.py
File tests/query_test/test_parquet_stats.py:

http://gerrit.cloudera.org:8080/#/c/12065/9/tests/query_test/test_parquet_stats.py@94
PS9, Line 94: q
> flake8: W292 no newline at end of file
Done



--
To view, visit http://gerrit.cloudera.org:8080/12065
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0cc99f129f2048dbafbe7f5a51d1ea3a5005731a
Gerrit-Change-Number: 12065
Gerrit-PatchSet: 10
Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Lars Volker <l...@cloudera.com>
Gerrit-Reviewer: Michael Ho <k...@cloudera.com>
Gerrit-Reviewer: Pooja Nilangekar <pooja.nilange...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Comment-Date: Fri, 12 Apr 2019 10:35:30 +0000
Gerrit-HasComments: Yes

Reply via email to