Csaba Ringhofer has posted comments on this change. ( http://gerrit.cloudera.org:8080/12065 )
Change subject: IMPALA-5843: Use page index in Parquet files to skip pages ...................................................................... Patch Set 5: (31 comments) Great work with the tests! I am generally ok with the patch, my only complaint is that parquet-column-readers.cc became even more ridiculously complex. I would welcome any idea do make it somewhat simpler even at the cast of small performance regressions in some cases. http://gerrit.cloudera.org:8080/#/c/12065/5//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/12065/5//COMMIT_MSG@30 PS5, Line 30: Testing I am curious: did you run FE/EE tests on data that was loaded with Parquet index writing turned on? I would expect some planner tests to be broken by the changed file sizes. http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/hdfs-scan-node-base.cc File be/src/exec/hdfs-scan-node-base.cc: http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/hdfs-scan-node-base.cc@610 PS5, Line 610: buffer_opts, move(sub_ranges), metadata); nit: indentation http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/hdfs-parquet-scanner.h File be/src/exec/parquet/hdfs-parquet-scanner.h: http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/hdfs-parquet-scanner.h@324 PS5, Line 324: /// the ScannerContext. It would be very nice to extend this documentation with some high level info on page filtering and row skipping. http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/hdfs-parquet-scanner.h@413 PS5, Line 413: /// Contains the leftover ranges after evaluating the Page index. It could be mentioned that empty candidate_ranges_ means that there is no page filtering, so every page is needed. http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/hdfs-parquet-scanner.h@462 PS5, Line 462: /// Number of pages that need to be read. : RuntimeProfile::Counter* num_pages_counter_; The comment is a bit confusing: pages dropped by stats filtering do not "have to be read", while num_pages_counter_ seems to contain the total number of pages. http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/hdfs-parquet-scanner.h@464 PS5, Line 464: It could be also useful to also track partially filtered pages. This could make tests more exact and help in the future if we want to experiment with writing pages in an aligned way. I am ok with not doing it in this patch. Another idea for an extra stat is to track two kind of filtered pages separately: - those that were dropped directly by a predicate on their column - those that were dropped because they were completely covered with skipped row ranges http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/hdfs-parquet-scanner.h@512 PS5, Line 512: initialize nit: initializing? http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/hdfs-parquet-scanner.h@523 PS5, Line 523: agrees nit: agree http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/hdfs-parquet-scanner.cc File be/src/exec/parquet/hdfs-parquet-scanner.cc: http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/hdfs-parquet-scanner.cc@810 PS5, Line 810: num_pages_counter_ This is the only place I found where this counter is incremented, which means that it will remain 0 if no page filtering occurred. I would prefer to always set it. http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/hdfs-parquet-scanner.cc@1816 PS5, Line 1816: candidate_ranges_.empty() expected_rows_in_group could become a member and the effect of page filtering could be applied to it, so this test would be still valid in scans with filtered pages. http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/hdfs-parquet-scanner.cc@1846 PS5, Line 1846: if (candidate_ranges_.empty()) DCHECK_EQ(reader->num_values_read_, num_values_read); : // ReadDataPage() uses metadata_->num_values to determine when the column's done : DCHECK(!candidate_ranges_.empty() || : (reader->num_values_read_ == reader->metadata_->num_values || : !state_->abort_on_error())); Is it possible to "revive" these tests in the page filtered world? http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/parquet-column-readers.h File be/src/exec/parquet/parquet-column-readers.h: http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/parquet-column-readers.h@486 PS5, Line 486: current_top_level_row_ current_top_level_row_ was renamed to current_row_. Note that I prefer the old name. http://gerrit.cloudera.org:8080/#/c/12065/4/be/src/exec/parquet/parquet-column-readers.cc File be/src/exec/parquet/parquet-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/12065/4/be/src/exec/parquet/parquet-column-readers.cc@1088 PS4, Line 1088: > Removed DCHECK. Can't the negative sub_range length cause trouble somewhere later? http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/parquet-column-readers.cc File be/src/exec/parquet/parquet-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/parquet-column-readers.cc@454 PS5, Line 454: while I think that this loop is way too large and is ripe for refactor. My idea is to keep the NextPage() and the candidate_data_pages_ logic here and move the middle of the loop to a function like ReadValueBatchWithinPage() http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/parquet-column-readers.cc@458 PS5, Line 458: if (!NextPage()) { Mentioning that NextPage() skips rows if the page starts within a filtered range would make it clear why the skipping logic is at the end of the loop. http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/parquet-column-readers.cc@466 PS5, Line 466: if (!MATERIALIZED && !IN_COLLECTION) { : int vals_to_add = min(num_buffered_values_, max_values - val_count); : val_count += vals_to_add; : num_buffered_values_ -= vals_to_add; : current_row_ += vals_to_add; : DCHECK_GE(num_buffered_values_, 0); : continue; : } This block doesn't seem to respect page skipping. Is this valid because if MATERIALIZED is false (which means count(*) without predicates) then there cannot be any page filtering?This could be mentioned in a comment and checked with a DCHECK. http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/parquet-column-readers.cc@478 PS5, Line 478: (IN_COLLECTION && pos_slot_desc_ != nullptr) || !candidate_data_pages_.empty() Do we have to care about rep levels if we are not IN_COLLECTION and there are some filtered pages? http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/parquet-column-readers.cc@544 PS5, Line 544: if ((IN_COLLECTION && pos_slot_desc_ != nullptr) || !candidate_data_pages_.empty()) { Same as at line 478. http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/parquet-column-readers.cc@553 PS5, Line 553: if (!candidate_data_pages_.empty()) { : int peek_rep_level = IN_COLLECTION ? rep_levels_.CachePeekNext() : 0; IN_COLLECTION could be moved to the 'if', as the block seems a NOOP if it is false. http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/parquet-column-readers.cc@599 PS5, Line 599: if (pos_slot_desc_ != nullptr || !candidate_data_pages_.empty()) { : DCHECK(rep_levels_.CacheHasNext()); : } This seems weird if we are not in a collection. http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/parquet-column-readers.cc@629 PS5, Line 629: // Page filtering can also put an upper limit on 'num_def_levels_to_consume'. : if (!candidate_data_pages_.empty()) { : // If we filter pages, the filtered range also puts a limit on : // 'num_def_levels_to_consume'. These two comments seem to say the same thing. http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/parquet-column-readers.cc@1105 PS5, Line 1105: else if (data_start < data_start_based_on_offset_index) { : // 'dictionary_page_offset' is not set, but the offset index and : // column chunk metadata disagree on the data start => column chunk's data start : // is actually the location of the dictionary page. : int64_t dict_start = data_start; : sub_ranges.push_back({dict_start, data_start_based_on_offset_index - dict_start}); Is there a tool that creates such Parquet files? http://gerrit.cloudera.org:8080/#/c/12065/5/testdata/data/README File testdata/data/README: http://gerrit.cloudera.org:8080/#/c/12065/5/testdata/data/README@301 PS5, Line 301: decimals_1_10.parquet nit: most file names are followed by ":" (the last 4 are exceptions) http://gerrit.cloudera.org:8080/#/c/12065/5/testdata/data/README@302 PS5, Line 302: Contains two decimal columns, one with precision 1, the other with precision 10. : I used Hive 2.1.1 on top of Parquet-MR (6901a20) to create tiny, misaligned pages in : order to test the value-skipping logic in the Parquet column readers. The other copies of this sentence could just point here. The blocks of copy-paste make it harder to notice that the Hive parameters are actually different. http://gerrit.cloudera.org:8080/#/c/12065/5/testdata/data/README@305 PS5, Line 305: I slightly modified Parquet-MR (MIN_SLAB_SIZE = 1). It could be useful to upload it to private github and add a link here. http://gerrit.cloudera.org:8080/#/c/12065/5/testdata/data/README@309 PS5, Line 309: (1, 1), (2, 2), (3, 3), (4, 4), (5, 5), I didn't spot any tests where the page filtering eliminated all of the pages. Creating a query that survives column stat filtering but eliminates all pages during page filtering would need pages with gaps between max and min values, e.g. the missing 6 in (min=1 max=5) (min=7 max=11). http://gerrit.cloudera.org:8080/#/c/12065/5/testdata/data/README@319 PS5, Line 319: (1, 1), (2, 2), (3, 3), (4, 4), (5, NULL); I didn't notice any all-null page in the data. It would be useful to create one to test the skipping of null pages. http://gerrit.cloudera.org:8080/#/c/12065/5/testdata/data/README@334 PS5, Line 334: Does the empty lines have a meaning, e.g. do new pages start here for one of the columns? http://gerrit.cloudera.org:8080/#/c/12065/5/testdata/workloads/functional-query/queries/QueryTest/parquet-page-index-alltypes-tiny-pages.test File testdata/workloads/functional-query/queries/QueryTest/parquet-page-index-alltypes-tiny-pages.test: http://gerrit.cloudera.org:8080/#/c/12065/5/testdata/workloads/functional-query/queries/QueryTest/parquet-page-index-alltypes-tiny-pages.test@3 PS5, Line 3: select * While it is useful to have a few tests that show the whole output, most tests would be ok with select count(*) instead of select *. http://gerrit.cloudera.org:8080/#/c/12065/5/testdata/workloads/functional-query/queries/QueryTest/parquet-page-index-alltypes-tiny-pages.test@4 PS5, Line 4: ---- RESULTS Some results could be crosschecked with the same query when READ_PARQUET_PAGE_INDEX=0. http://gerrit.cloudera.org:8080/#/c/12065/5/testdata/workloads/functional-query/queries/QueryTest/parquet-page-index-alltypes-tiny-pages.test@95 PS5, Line 95: select * from alltypes_tiny_pages where bigint_col = 0 and month = 7 and year = 2010 It would be nice to have a test that demonstrates that column indexes for different columns "work together", e.g. a: where month = 7 b: where year = 2010 c: where month = 7 and year = 2010 The NumStatsFilteredPages of c. is likely to be higher than a. or b., which would show that both columns are used for filtering. Sorry if there is already such and I just didn't spot it. -- To view, visit http://gerrit.cloudera.org:8080/12065 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I0cc99f129f2048dbafbe7f5a51d1ea3a5005731a Gerrit-Change-Number: 12065 Gerrit-PatchSet: 5 Gerrit-Owner: Zoltan Borok-Nagy <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Lars Volker <[email protected]> Gerrit-Reviewer: Pooja Nilangekar <[email protected]> Gerrit-Reviewer: Tim Armstrong <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]> Gerrit-Comment-Date: Thu, 28 Feb 2019 22:12:17 +0000 Gerrit-HasComments: Yes
