Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/12065 )
Change subject: IMPALA-5843: Use page index in Parquet files to skip pages ...................................................................... Patch Set 10: (22 comments) http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/hdfs-parquet-scanner.h File be/src/exec/parquet/hdfs-parquet-scanner.h: http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/hdfs-parquet-scanner.h@552 PS8, Line 552: /// column is specified by 'column_index'. 'stats_field' specifies whether we should > Left some comment in the call-site: I think pulling that part out of the fu Added GetRequiredStatsField() utility function to make the call more readable. It just didn't feel natural to me to have the 'fn_name' to stats_field logic in ReadStatFromIndex(). http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/hdfs-parquet-scanner.cc File be/src/exec/parquet/hdfs-parquet-scanner.cc: http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@644 PS8, Line 644: // {min: 10, max: 20}, and query is 'select * from T where A = 8'. > Independent of the discussion which counters to maintain, I took a while to Ah sorry, I wrote an explanation, but then I replaced my whole comment with "See reply...". :\ It's not obvious at all, in fact, the early patch sets didn't handle that case. Added comment and example here. http://gerrit.cloudera.org:8080/#/c/12065/9/be/src/exec/parquet/hdfs-parquet-scanner.cc File be/src/exec/parquet/hdfs-parquet-scanner.cc: http://gerrit.cloudera.org:8080/#/c/12065/9/be/src/exec/parquet/hdfs-parquet-scanner.cc@700 PS9, Line 700: switch (stats_field) { > Use a case-statement, now that it's an enum? Done http://gerrit.cloudera.org:8080/#/c/12065/9/be/src/exec/parquet/hdfs-parquet-scanner.cc@758 PS9, Line 758: if (col_chunk.column_index_length == 0) continue; : : parquet::ColumnIndex column_index; : RETURN_IF_ERROR(page_index_.DeserializeColumnIndex(col_chunk, &column_index)); : : const ColumnType& col_type = slot_desc->type(); : const vector<parquet::ColumnOrder>& col_orders = file_metadata_.column_orders; : const parquet::ColumnOrder* col_order = col_idx < col_orders.size() ? : &col_orders[col_idx] : nullptr; : C > This looks like a good candidate for another utility function to keep this Done http://gerrit.cloudera.org:8080/#/c/12065/9/be/src/exec/parquet/parquet-common-test.cc File be/src/exec/parquet/parquet-common-test.cc: http://gerrit.cloudera.org:8080/#/c/12065/9/be/src/exec/parquet/parquet-common-test.cc@55 PS9, Line 55: ValidateRanges({{1, 2}, {1, 4}, {1, 8}}, 10, {{0, 0}, {9, 9}}); > I think wrapping this in something called ValidateRangesError() would make Done http://gerrit.cloudera.org:8080/#/c/12065/9/be/src/exec/parquet/parquet-common-test.cc@71 PS9, Line 71: ageLoc > nit: renaming this to "candidate_pages" would help with readibility Done http://gerrit.cloudera.org:8080/#/c/12065/9/be/src/exec/parquet/parquet-common-test.cc@94 PS9, Line 94: ValidatePages({0, 10, 20, 50, 70}, {{0, 9}}, 100, {0}); > Maybe add one > UINT_MAX for good measure? Done http://gerrit.cloudera.org:8080/#/c/12065/9/be/src/exec/parquet/parquet-common.h File be/src/exec/parquet/parquet-common.h: http://gerrit.cloudera.org:8080/#/c/12065/9/be/src/exec/parquet/parquet-common.h@49 PS9, Line 49: return std::tie(lhs.first, lhs.last) < std::tie(rhs.first, rhs.last); > I think this is the same as Thanks! Done. http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-common.h File be/src/exec/parquet/parquet-common.h: http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-common.h@74 PS8, Line 74: 'cand > I don't feel strongly about it, but it always makes me wonder since we usua Yeah, those "Yoda conditions" never felt natural to me :) OK, I'll leave it as it is if you don't mind. http://gerrit.cloudera.org:8080/#/c/12065/9/be/src/exec/parquet/parquet-page-index-test.cc File be/src/exec/parquet/parquet-page-index-test.cc: http://gerrit.cloudera.org:8080/#/c/12065/9/be/src/exec/parquet/parquet-page-index-test.cc@35 PS9, Line 35: /// Creates a vector of parquet::RowGroup objects based on data in > Methods in this file could benefit from some brief comments on what they do Added comment. http://gerrit.cloudera.org:8080/#/c/12065/9/be/src/exec/parquet/parquet-page-index-test.cc@108 PS9, Line 108: ValidatePageIndexRange({{{10, 5, 15, 5}}, {{20, 10, -1, -1}}}, true, 10, 20); > line too long (94 > 90) Done http://gerrit.cloudera.org:8080/#/c/12065/9/be/src/exec/parquet/parquet-page-index.h File be/src/exec/parquet/parquet-page-index.h: http://gerrit.cloudera.org:8080/#/c/12065/9/be/src/exec/parquet/parquet-page-index.h@54 PS9, Line 54: /// Determines the file range that contains the page index for all row groups. > A comment would be good here. One of the outputs could be returned here, e. Added comment. http://gerrit.cloudera.org:8080/#/c/12065/9/be/src/exec/parquet/parquet-page-index.h@68 PS9, Line 68: p > nit: lowercase Done http://gerrit.cloudera.org:8080/#/c/12065/9/common/thrift/ImpalaService.thrift File common/thrift/ImpalaService.thrift: http://gerrit.cloudera.org:8080/#/c/12065/9/common/thrift/ImpalaService.thrift@385 PS9, Line 385: granularity > spelling Done http://gerrit.cloudera.org:8080/#/c/12065/9/common/thrift/ImpalaService.thrift@387 PS9, Line 387: PARQUET_READ_PAGE_INDEX = 79 > I disagree here - having a separate query option can be useful during bench I agree with Csaba. I think it's safer to have the option to turn it off without any side-effect. Maybe later we can deprecate this option. http://gerrit.cloudera.org:8080/#/c/12065/9/common/thrift/ImpalaService.thrift@387 PS9, Line 387: PARQUET_READ_PAGE_INDEX = 79 > All other Parquet query options start with PARQUET_..., can we do the same Done http://gerrit.cloudera.org:8080/#/c/12065/8/testdata/data/README File testdata/data/README: http://gerrit.cloudera.org:8080/#/c/12065/8/testdata/data/README@323 PS8, Line 323: two columns, one is a > I think it would be good to push that code somewhere, even if it's a person Added link to my personal github repo. http://gerrit.cloudera.org:8080/#/c/12065/9/testdata/data/README File testdata/data/README: http://gerrit.cloudera.org:8080/#/c/12065/9/testdata/data/README@353 PS9, Line 353: Contains two columns, one is a decimal column, the other is an array of arrays of > nit: double word Done http://gerrit.cloudera.org:8080/#/c/12065/9/testdata/data/README@402 PS9, Line 402: fro > nit: double word Done http://gerrit.cloudera.org:8080/#/c/12065/9/testdata/data/README@409 PS9, Line 409: fro > nit: double word Done http://gerrit.cloudera.org:8080/#/c/12065/8/testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-page-index.test File testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-page-index.test: http://gerrit.cloudera.org:8080/#/c/12065/8/testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-page-index.test@1 PS8, Line 1: # These tests check that the page selection and value value-skipping logic works well > Thanks. Did you have anything particular in mind when writing the test quer I added query-specific comments to 'parquet-page-index.test' and 'nested-types-parquet-page-index.test' where I had different test cases in mind. The other '.test' files contain quite random predicates so I only extended the generic comment a bit. http://gerrit.cloudera.org:8080/#/c/12065/9/tests/query_test/test_parquet_stats.py File tests/query_test/test_parquet_stats.py: http://gerrit.cloudera.org:8080/#/c/12065/9/tests/query_test/test_parquet_stats.py@94 PS9, Line 94: q > flake8: W292 no newline at end of file Done -- To view, visit http://gerrit.cloudera.org:8080/12065 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I0cc99f129f2048dbafbe7f5a51d1ea3a5005731a Gerrit-Change-Number: 12065 Gerrit-PatchSet: 10 Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Michael Ho <k...@cloudera.com> Gerrit-Reviewer: Pooja Nilangekar <pooja.nilange...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Comment-Date: Fri, 12 Apr 2019 10:35:30 +0000 Gerrit-HasComments: Yes