Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/12065 )
Change subject: IMPALA-5843: Use page index in Parquet files to skip pages ...................................................................... Patch Set 7: (54 comments) In a few comments the row group level profile counters came up. Whether we should only have "NumStatsFilteredRowGroups" and count row groups that are filtered out by either row group-level stats or page-level stats. Or, we should have multiple variants of that counter, counting the row group-level and page-level filtering separately. What do you think? For debugging it's good to have more fine-grained data, but I'm not sure that the query profile is the right place for such information. http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/hdfs-scan-node-base.h File be/src/exec/hdfs-scan-node-base.h: http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/hdfs-scan-node-base.h@272 PS8, Line 272: io::ScanRange* AllocateScanRange(hdfsFS fs, const char* file, > nit: same line wrapping for both methods Done http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/hdfs-parquet-scanner.h File be/src/exec/parquet/hdfs-parquet-scanner.h: http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/hdfs-parquet-scanner.h@360 PS8, Line 360: name in LLVM IR. > nit: make this a const& or non-const ptr. If a const* is required to be nul The signature matches the constructor of ColumnStatsReader. I added a short comment about that 'col_order' might be null. http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/hdfs-parquet-scanner.h@428 PS8, Line 428: > nit: lower case Done http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/hdfs-parquet-scanner.h@429 PS8, Line 429: at are eli > nit: missing comma Done http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/hdfs-parquet-scanner.h@552 PS8, Line 552: /// Returns OK if the query is not cancelled and hasn't exceeded any mem limits. > This looks like a candidate to explain the various parameters in some detai Added comments. Instead of 'fn_name', I modified this function to take a 'stats_field' parameter. http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc File be/src/exec/parquet/hdfs-parquet-scanner.cc: http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@119 PS7, Line 119: num_stats_filtered_row_groups_by_page_index_counter_ = > Should these be NumStatsFilteredRowGroups and NumStatsFilteredPages? See reply comment http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@119 PS7, Line 119: num_stats_filtered_row_groups_by_page_index_counter_ = > I am not sure how to track this, but it is also possible that some columns I'm checking if there's an offset index for the first column chunk. Offset index should be present for every column, including timestamp and double columns as well. http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@803 PS7, Line 803: CalculateCandidateRanges > That's a good point. One could pass skip_ranges by value and then let the c I'm passing a pointer so I can certainly avoid copying. http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/hdfs-parquet-scanner.cc@813 PS7, Line 813: d > You could pass the scalar reader into ComputeCandidatePages() to call the s Yeah I definetely don't want to introduce a dependency in parquet-common to ScalarReaders. And using a lambda feels a bit of an overkill to me for such a simple problem. But if you feel strongly about it I can settle with the lambda. What concerns me most about setting this member of ScalarReader here is that later we are calling InitScalarColumns() and even Reset() on the scalar readers. I just couldn't yet figure out a better control flow without using unnecessary boilerplate code and extra state. But maybe I'll just store the candidate pages in a hash map, then pass them to the scalar readers InitScalarColumns(). I could use a setter there, or extend the Reset() member function. http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/hdfs-parquet-scanner.cc File be/src/exec/parquet/hdfs-parquet-scanner.cc: http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@117 PS8, Line 117: NumStatsFilteredRowGroups > This name is not precise anymore, as it only means row groups filtered by c See reply comment. http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@120 PS8, Line 120: NumStatsFilteredRowGroupsByPageIndex > Maybe "completely" could be added to the name to make its meaning more exac See reply comment. http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@517 PS8, Line 517: const parquet::ColumnChunk& col_chunk = row_group.columns[col_idx]; : const ColumnType& col_type = slot_d > nit: I prefer breaking after = I looked at the assignments of this file and parquet-column-readers.cc and we usually don't break lines right after =. I'd prefer to keep it this way to keep consistency and to potentially save lines. http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@525 PS8, Line 525: } > nit: I prefer breaking after = See other comment http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@526 PS8, Line 526: > nit: double space Done http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@644 PS8, Line 644: InitCollectionColumns(); > Do we expect this to happen in reality if the row group also has stats? If See reply comment. http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@752 PS8, Line 752: _ > nit: I prefer breaking after = See other comment http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@755 PS8, Line 755: column_index.max_valu > nit: I prefer breaking after = See other comment http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/hdfs-parquet-scanner.cc@773 PS8, Line 773: // the RowGroup. For all other pages, it end at first_row_index of the next > nit: I would prefer breaking before GetRowRangeForPage GetRowRangeForPage() now uses an output parameter, so there is no line break anymore. http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-column-readers.h File be/src/exec/parquet/parquet-column-readers.h: http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-column-readers.h@503 PS8, Line 503: le (rep_levels_.Cach > I think that adding "InRowGroup" to the name would make the call site clear I prefer this name because it is shorter, and this class has a lot of members that only have meaning in the current row group so I think it shouldn't distract the readers. http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-column-readers.h@526 PS8, Line 526: > The name mislead me a bit, I have assumed that it returns an integer. Maybe Thanks for suggestion! It's better name indeed. http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-column-readers.cc File be/src/exec/parquet/parquet-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-column-readers.cc@390 PS8, Line 390: template <typename InternalType, parquet::Type::type PARQUET_TYPE, bool MATERIALIZED> > line too long (91 > 90) Done http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-column-readers.cc@1732 PS8, Line 1732: parent, node, slot_desc); > This seems to be the only place where num_values_read_ is decreased, so it Yeah, I agree. Added comment to 'num_values_read'. http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-common-test.cc File be/src/exec/parquet/parquet-common-test.cc: http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-common-test.cc@32 PS8, Line 32: TEST(ParquetCommon, CalculateCandidateRanges) { > Please add some comments to the TEST functions that briefly explain what th Done http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-common-test.cc@32 PS8, Line 32: CalculateCandidateRanges > rename Done http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-common-test.cc@49 PS8, Line 49: CalculateCandidateRanges({{1, 8}, {3, 8}, {7, 8}}, 10), {{0, 0}, {9, 9}}); > Let's add some tests about error conditions, either here or elsewhere (this Modified ComputeCandidateRanges() to return a bool to indicate success. http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-common-test.cc@52 PS8, Line 52: int > It would be good to have some tests that validate that values > INT_MAX don Switched to int64_t and added test case with values greater than INT_MAX. http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-common-test.cc@61 PS8, Line 61: CalculateCandidatePages > rename Done http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-common-test.cc@77 PS8, Line 77: ValidatePages({0, 10, 20, 50, 70}, {{9, 9}, {10, 10}, {50, 50}}, 100, {0, 1, 3}); > Some tests for error conditions would be good here, too. Added some test cases for failure. Modified ComputeCandidatePages() to return a bool to indicate success. http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-common.h File be/src/exec/parquet/parquet-common.h: http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-common.h@72 PS8, Line 72: e complex > "want to read" or "want to scan"? Done http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-common.h@74 PS8, Line 74: calls > nit: I think we usually omit this when passing by value I only like to keep them const to avoid having typos in 'if' statements, e.g.: if (num_rows = 42) It'd change the value of 'num_rows' and the conditions would evaluate to true. But I can change it to non-const for consistency if you still want. Proper testing should also protect us against such errors. http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-common.h@329 PS8, Line 329: return byte_size; : } > I am not sure if this matters but str_len of -4 is accepted as OK. Done http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-common.cc File be/src/exec/parquet/parquet-common.cc: http://gerrit.cloudera.org:8080/#/c/12065/7/be/src/exec/parquet/parquet-common.cc@123 PS7, Line 123: vector<int> filtered_pages; > I think "candidate_pages" is ok as long as it's only ever used as a positiv Done http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-common.cc File be/src/exec/parquet/parquet-common.cc: http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-common.cc@92 PS8, Line 92: vector<pair<int64_t, int64_t>> CalculateCandidateRanges( > nit: Return by pointer. I'm open to have a discussion whether we want to ma I switched to return by pointer. I think since we don't use exceptions we can only report status via the return value or an output parameter, so I think there's little hope to switch to "return by value" by default. http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-common.cc@95 PS8, Line 95: int skip_end = -1; > nit: const auto&? auto& deduces the const for itself, but I added the const keyword anyway for better readability. http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-common.cc@106 PS8, Line 106: if (skip_end < num_rows - 1) { > Should we mark this one as static? We can't really do that because we call it from hdfs-parquet-scanner.cc. http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-level-decoder.h File be/src/exec/parquet/parquet-level-decoder.h: http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-level-decoder.h@109 PS8, Line 109: // Prepare cache for reading if it's empty. > Mention what this does and what the return value means? Extended the comment. http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-page-index.h File be/src/exec/parquet/parquet-page-index.h: http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-page-index.h@32 PS8, Line 32: /// Helper class for reading the Parquet page index. > Mention briefly what this class does, i.e. the buffer management required t Extended class comment http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-page-index.h@37 PS8, Line 37: /// It needs to be called before the serialization methods. > nit: Maybe make this read a bit more fluently, and mention what it does bef Done http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-page-index.h@61 PS8, Line 61: ScopedBuffer raw_page_index_; > rename to page_index_buffer_ or just buffer_? Done http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-page-index.h@66 PS8, Line 66: > missing word? Done http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-page-index.h@68 PS8, Line 68: int64_t max_offset_; > We could replace this with the buffer size to make the relation between the Extended ScopedBuffer with size() member function and use it. http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-page-index.cc File be/src/exec/parquet/parquet-page-index.cc: http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-page-index.cc@43 PS8, Line 43: base_offset_ = numeric_limits<int64_t>::max(); : max_offset_ = -1; > remove? They are the initial values for min/max searching. http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-page-index.cc@47 PS8, Line 47: int64_t ci_start = col_chunk.column_index_offset; > We should have a test that makes sure that this logic behaves well for brok I moved the logic that calculates the file range to a static function and added unit tests for it in the newly created 'parquet-page-index-test.cc'. http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-page-index.cc@54 PS8, Line 54: if (idx_max > max_offset_) max_offset_ = idx_max; > This is just max3(ci_end, oi_end, max_offset_), right? I just found out tha I switched to min(min(a,b),c) and max(max(a,b),c). I microbenchmarked min(min(a,b),c) vs min({a,b,c}) and the latter is 4x slower. It probably won't make a big difference, but I'd rather avoid using initializer lists for that. My main concern is that it might allocate its elements on the heap, but probably the compiler is more clever than that. http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-page-index.cc@57 PS8, Line 57: if (base_offset_ >= max_offset_) return Status::OK(); > Should we have a comment to explain why we don't return an error here, give Added comment about it. http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/exec/parquet/parquet-page-index.cc@108 PS8, Line 108: DCHECK_GE(file_offset, base_offset_); > This might be slightly more readable by changing it do DCHECK_GE(buffer_off Done http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/service/query-options.cc File be/src/service/query-options.cc: http://gerrit.cloudera.org:8080/#/c/12065/8/be/src/service/query-options.cc@730 PS8, Line 730: iequals(value, "true") || iequals(value, "1")); > Use IsTrue(), might also fix elsewhere since the other occurrence doesn't l Done http://gerrit.cloudera.org:8080/#/c/12065/8/common/thrift/ImpalaInternalService.thrift File common/thrift/ImpalaInternalService.thrift: http://gerrit.cloudera.org:8080/#/c/12065/8/common/thrift/ImpalaInternalService.thrift@331 PS8, Line 331: Page > Refer to the comment in ImpalaService.thrift like the other options? Done http://gerrit.cloudera.org:8080/#/c/12065/8/common/thrift/ImpalaService.thrift File common/thrift/ImpalaService.thrift: http://gerrit.cloudera.org:8080/#/c/12065/8/common/thrift/ImpalaService.thrift@384 PS8, Line 384: Page > We should mention what this does, similar to PARQUET_READ_STATISTICS. Extended commit message. nit: done http://gerrit.cloudera.org:8080/#/c/12065/8/testdata/data/README File testdata/data/README: http://gerrit.cloudera.org:8080/#/c/12065/8/testdata/data/README@323 PS8, Line 323: hacked version of hive > The previous file doesn't mention the Hive hack I rephrased it. I only modified Parquet-MR, then used the jars of it with Hive 2.1.1. http://gerrit.cloudera.org:8080/#/c/12065/8/testdata/data/README@353 PS8, Line 353: hacked version of hive > same Done http://gerrit.cloudera.org:8080/#/c/12065/8/testdata/data/README@402 PS8, Line 402: hacked version of hive > same here Done http://gerrit.cloudera.org:8080/#/c/12065/8/testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-page-index.test File testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-page-index.test: http://gerrit.cloudera.org:8080/#/c/12065/8/testdata/workloads/functional-query/queries/QueryTest/nested-types-parquet-page-index.test@1 PS8, Line 1: ==== > I think these tests would be easier to follow if they had a brief comment e I added some explanation to the beginning of each files. http://gerrit.cloudera.org:8080/#/c/12065/8/tests/query_test/test_parquet_stats.py File tests/query_test/test_parquet_stats.py: http://gerrit.cloudera.org:8080/#/c/12065/8/tests/query_test/test_parquet_stats.py@77 PS8, Line 77: def test_page_index(self, vector, unique_database): > Please add a comment to this function. Done -- To view, visit http://gerrit.cloudera.org:8080/12065 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I0cc99f129f2048dbafbe7f5a51d1ea3a5005731a Gerrit-Change-Number: 12065 Gerrit-PatchSet: 7 Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Michael Ho <k...@cloudera.com> Gerrit-Reviewer: Pooja Nilangekar <pooja.nilange...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Comment-Date: Mon, 08 Apr 2019 20:10:47 +0000 Gerrit-HasComments: Yes