Lars Volker has posted comments on this change. ( http://gerrit.cloudera.org:8080/12065 )
Change subject: IMPALA-5843: Use page index in Parquet files to skip pages ...................................................................... Patch Set 9: (21 comments) http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/hdfs-parquet-scanner.h File be/src/exec/parquet/hdfs-parquet-scanner.h: http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/hdfs-parquet-scanner.h@552 PS8, Line 552: > Added comments. Instead of 'fn_name', I modified this function to take a 's Left some comment in the call-site: I think pulling that part out of the function made the calling code slightly less readable. http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/hdfs-parquet-scanner.cc File be/src/exec/parquet/hdfs-parquet-scanner.cc: http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@644 PS8, Line 644: COUNTER_ADD(num_stats_filtered_row_groups_by_page_index_counter_, 1); > See reply comment. Independent of the discussion which counters to maintain, I took a while to understand how this can happen :). Maybe you want to leave a comment that explains how the row group statistics will not filter out a column but the pages will (e.g. with gaps between the pages)? I don't feel strongly about it, if you think it's obvious, feel free to leave it out. http://gerrit.cloudera.org:8080/#/c/12065/9/be/src/exec/parquet/hdfs-parquet-scanner.cc File be/src/exec/parquet/hdfs-parquet-scanner.cc: http://gerrit.cloudera.org:8080/#/c/12065/9/be/src/exec/parquet/hdfs-parquet-scanner.cc@700 PS9, Line 700: if (stats_field == ColumnStatsReader::StatsField::MIN) { Use a case-statement, now that it's an enum? http://gerrit.cloudera.org:8080/#/c/12065/9/be/src/exec/parquet/hdfs-parquet-scanner.cc@758 PS9, Line 758: const string& fn_name = eval->root().function_name(); : ColumnStatsReader::StatsField stats_field; : if (fn_name == "lt" || fn_name == "le") { : stats_field = ColumnStatsReader::StatsField::MIN; : } else if (fn_name == "gt" || fn_name == "ge") { : stats_field = ColumnStatsReader::StatsField::MAX; : } else { : DCHECK(false) << "Unsupported function name for statistics evaluation: " : << fn_name; : } This looks like a good candidate for another utility function to keep this one shorter, e.g. GetRequiredStatsField()? http://gerrit.cloudera.org:8080/#/c/12065/9/be/src/exec/parquet/parquet-common-test.cc File be/src/exec/parquet/parquet-common-test.cc: http://gerrit.cloudera.org:8080/#/c/12065/9/be/src/exec/parquet/parquet-common-test.cc@55 PS9, Line 55: ValidateRanges({{-1, 1}}, 10, {}, false); I think wrapping this in something called ValidateRangesError() would make it more readable. http://gerrit.cloudera.org:8080/#/c/12065/9/be/src/exec/parquet/parquet-common-test.cc@71 PS9, Line 71: result nit: renaming this to "candidate_pages" would help with readibility http://gerrit.cloudera.org:8080/#/c/12065/9/be/src/exec/parquet/parquet-common-test.cc@94 PS9, Line 94: ValidatePages({0}, {{0, 9LL + INT_MAX}}, 100LL + INT_MAX, {0}); Maybe add one > UINT_MAX for good measure? http://gerrit.cloudera.org:8080/#/c/12065/9/be/src/exec/parquet/parquet-common.h File be/src/exec/parquet/parquet-common.h: http://gerrit.cloudera.org:8080/#/c/12065/9/be/src/exec/parquet/parquet-common.h@49 PS9, Line 49: if (lhs.first < rhs.first) return true; I think this is the same as std::tie(lhs.first, lhs.second) < std::tie(rhs.first, rhs.second) http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-common.h File be/src/exec/parquet/parquet-common.h: http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-common.h@74 PS8, Line 74: can p > I only like to keep them const to avoid having typos in 'if' statements, e. I don't feel strongly about it, but it always makes me wonder since we usually don't denote this in the signature. You could also do if (42 == num_rows) :). Either way works for me, feel free to leave as is. http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-common.cc File be/src/exec/parquet/parquet-common.cc: http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-common.cc@95 PS8, Line 95: DCHECK_LT(page_idx, page_locations.size()); > auto& deduces the const for itself, but I added the const keyword anyway fo Oh, good to know, thx for the hint. http://gerrit.cloudera.org:8080/#/c/12065/9/be/src/exec/parquet/parquet-page-index-test.cc File be/src/exec/parquet/parquet-page-index-test.cc: http://gerrit.cloudera.org:8080/#/c/12065/9/be/src/exec/parquet/parquet-page-index-test.cc@35 PS9, Line 35: void ConstructFakeRowGroups(const vector<RowGroupRanges>& all_row_group_ranges, Methods in this file could benefit from some brief comments on what they do http://gerrit.cloudera.org:8080/#/c/12065/9/be/src/exec/parquet/parquet-page-index.h File be/src/exec/parquet/parquet-page-index.h: http://gerrit.cloudera.org:8080/#/c/12065/9/be/src/exec/parquet/parquet-page-index.h@54 PS9, Line 54: static void DeterminePageIndexRangeInFile( A comment would be good here. One of the outputs could be returned here, e.g. contains_page_index. http://gerrit.cloudera.org:8080/#/c/12065/9/be/src/exec/parquet/parquet-page-index.h@68 PS9, Line 68: P nit: lowercase http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-page-index.cc File be/src/exec/parquet/parquet-page-index.cc: http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-page-index.cc@54 PS8, Line 54: ci_start = col_chunk.column_index_offset; > I switched to min(min(a,b),c) and max(max(a,b),c). Interesting find, thx for trying that out. http://gerrit.cloudera.org:8080/#/c/12065/9/common/thrift/ImpalaService.thrift File common/thrift/ImpalaService.thrift: http://gerrit.cloudera.org:8080/#/c/12065/9/common/thrift/ImpalaService.thrift@385 PS9, Line 385: granurality spelling http://gerrit.cloudera.org:8080/#/c/12065/9/common/thrift/ImpalaService.thrift@387 PS9, Line 387: READ_PARQUET_PAGE_INDEX = 79 All other Parquet query options start with PARQUET_..., can we do the same here, e.g. PARQUET_READ_PAGE_INDEX? http://gerrit.cloudera.org:8080/#/c/12065/8/testdata/data/README File testdata/data/README: http://gerrit.cloudera.org:8080/#/c/12065/8/testdata/data/README@323 PS8, Line 323: ve Hive 2.1.1 with a m > I rephrased it. I only modified Parquet-MR, then used the jars of it with H I think it would be good to push that code somewhere, even if it's a personal repo on github. Otherwise it will be close to impossible to re-create these files. That's probably true for a bunch of files here though, and doing this shouldn't hold up the change. http://gerrit.cloudera.org:8080/#/c/12065/9/testdata/data/README File testdata/data/README: http://gerrit.cloudera.org:8080/#/c/12065/9/testdata/data/README@353 PS9, Line 353: decimals. I used hive Hive 2.1.1 with a modified Parquet-MR, see description nit: double word http://gerrit.cloudera.org:8080/#/c/12065/9/testdata/data/README@402 PS9, Line 402: hive nit: double word http://gerrit.cloudera.org:8080/#/c/12065/9/testdata/data/README@409 PS9, Line 409: hive nit: double word http://gerrit.cloudera.org:8080/#/c/12065/8/testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-page-index.test File testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-page-index.test: http://gerrit.cloudera.org:8080/#/c/12065/8/testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-page-index.test@1 PS8, Line 1: # These tests check that the page selection and value value-skipping logic works well > I added some explanation to the beginning of each files. Thanks. Did you have anything particular in mind when writing the test queries, e.g. making sure that a particular corner case is covered? If so, I think it would greatly help the reader to add that as a brief comment to each query, e.g. similar to this file: https://github.com/apache/impala/blob/master/testdata/workloads/functional-query/queries/QueryTest/parquet-filtering.test#L259 -- To view, visit http://gerrit.cloudera.org:8080/12065 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I0cc99f129f2048dbafbe7f5a51d1ea3a5005731a Gerrit-Change-Number: 12065 Gerrit-PatchSet: 9 Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Michael Ho <k...@cloudera.com> Gerrit-Reviewer: Pooja Nilangekar <pooja.nilange...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Comment-Date: Tue, 09 Apr 2019 21:53:14 +0000 Gerrit-HasComments: Yes