Zoltan Borok-Nagy has posted comments on this change. ( http://gerrit.cloudera.org:8080/12065 )
Change subject: IMPALA-5843: Use page index in Parquet files to skip pages ...................................................................... Patch Set 6: (30 comments) http://gerrit.cloudera.org:8080/#/c/12065/5//COMMIT_MSG Commit Message: http://gerrit.cloudera.org:8080/#/c/12065/5//COMMIT_MSG@30 PS5, Line 30: Testing > I am curious: did you run FE/EE tests on data that was loaded with Parquet I needed to update the stats extrapolation tests. http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/hdfs-scan-node-base.cc File be/src/exec/hdfs-scan-node-base.cc: http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/hdfs-scan-node-base.cc@610 PS5, Line 610: buffer_opts, move(sub_ranges), metadata); > nit: indentation Done http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/hdfs-parquet-scanner.h File be/src/exec/parquet/hdfs-parquet-scanner.h: http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/hdfs-parquet-scanner.h@324 PS5, Line 324: /// the ScannerContext. > It would be very nice to extend this documentation with some high level inf Added some comments. http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/hdfs-parquet-scanner.h@413 PS5, Line 413: ParquetFileVersion file_version_; > It could be mentioned that empty candidate_ranges_ means that there is no p Done http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/hdfs-parquet-scanner.h@462 PS5, Line 462: : /// Number of columns that need to be read. > The comment is a bit confusing: pages dropped by stats filtering do not "ha I uses similar comments that we use for row group counters. Anyway, I modified and extended the comment. http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/hdfs-parquet-scanner.h@464 PS5, Line 464: RuntimeProfile::Counter* num_cols_counter_; > It could be also useful to also track partially filtered pages. This could These counters could be useful for debugging, but confusing and too low-level in the profile in my opinion. I think the predicates in the query string and the existing counters above give us enough information to validate page selection. I added a new counter for row groups that are filtered by the page index. http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/hdfs-parquet-scanner.h@512 PS5, Line 512: > nit: initializing? Done http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/hdfs-parquet-scanner.h@523 PS5, Line 523: Recov > nit: agree Done http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/hdfs-parquet-scanner.cc File be/src/exec/parquet/hdfs-parquet-scanner.cc: http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/hdfs-parquet-scanner.cc@810 PS5, Line 810: oup& row_group = f > This is the only place I found where this counter is incremented, which mea It is also incremented in parquet-column-readers.cc::BaseScalarColumnReader::ReadDataPage(). http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/hdfs-parquet-scanner.cc@1816 PS5, Line 1816: > expected_rows_in_group could become a member and the effect of page filteri We can calculate the expected row count here as well. Added a new nested if statement. http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/hdfs-parquet-scanner.cc@1846 PS5, Line 1846: } : } : } : : // Validate scalar column readers' > Is it possible to "revive" these tests in the page filtered world? Currently 'reader->num_values_read_' is the sum of 'num_values' from the parquet pages. We subtract from this whenever we skip values, but it would make the column readers a bit more complex. I have added some extra validation in CheckPageFiltering() for the case when we use the page index. http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/parquet-column-readers.h File be/src/exec/parquet/parquet-column-readers.h: http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/parquet-column-readers.h@486 PS5, Line 486: values' or up to value > current_top_level_row_ was renamed to current_row_. Note that I prefer the I didn't want to spell "top_level" every time. And it was confusing to have both top level rows and rows in the code. Now "row" means the top level rows and "value" means the "primitive" values. http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/parquet-column-readers.cc File be/src/exec/parquet/parquet-column-readers.cc: http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/parquet-column-readers.cc@454 PS5, Line 454: while > I think that this loop is way too large and is ripe for refactor. My idea i Will think about it. Currently I'm not sure that substituting those lines with a function call that takes a lot of parameters would improve readability significantly. http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/parquet-column-readers.cc@458 PS5, Line 458: if (num_buffered_values_ == 0) { > Mentioning that NextPage() skips rows if the page starts within a filtered Extended the comment above. http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/parquet-column-readers.cc@466 PS5, Line 466: // count from page metadata to return the correct number of rows. : if (!MATERIALIZED && !IN_COLLECTION) { : // We cannot filter pages in this context. : DCHECK(candidate_data_pages_.empty()); : int vals_to_add = min(num_buffered_values_, max_values - val_count); : val_count += vals_to_add; : num_buffered_values_ -= vals_to_add; : > This block doesn't seem to respect page skipping. Is this valid because if Added comment and DCHECK http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/parquet-column-readers.cc@478 PS5, Line 478: re about the nesting structure unless the position slot is being populated, > Do we have to care about rep levels if we are not IN_COLLECTION and there a Updated the condition. http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/parquet-column-readers.cc@544 PS5, Line 544: DCHECK_GT(num_buffered_values_, 0); > Same as at line 478. Done http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/parquet-column-readers.cc@553 PS5, Line 553: max_values = min(max_values, num_buffered_values_); : while (def_levels_.CacheHasNext() && val_count < max_values) { > IN_COLLECTION could be moved to the 'if', as the block seems a NOOP if it i It's not a NOOP in that case. 'peek_rep_level' will be 0, which is part of the condition in the following line. http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/parquet-column-readers.cc@599 PS5, Line 599: } : : // > This seems weird if we are not in a collection. We don't have IN_COLLECTION template parameter here, but added 'max_rep_level_ > 0' http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/parquet-column-readers.cc@629 PS5, Line 629: } : } else { : // Cannot consume more levels than allowed by buffered input values and output space. : num_def_levels_to_consume = min > These two comments seem to say the same thing. Done http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/parquet-column-readers.cc@1105 PS5, Line 1105: } : else if (data_start < data_start_based_on_offset_index) { : // 'dictionary_page_offset' is not set, but the offset index and : // column chunk metadata disagree on the data start => column chunk's data start : // is actually the location of the dictionary page. : int64_t dict_start = data_start; > Is there a tool that creates such Parquet files? Parquet-MR behaves like that. It doesn't populate the 'dictionary_page_offset', but set 'data_page_offset' to the dictionary page. http://gerrit.cloudera.org:8080/#/c/12065/5/testdata/data/README File testdata/data/README: http://gerrit.cloudera.org:8080/#/c/12065/5/testdata/data/README@301 PS5, Line 301: decimals_1_10.parquet: > nit: most file names are followed by ":" (the last 4 are exceptions) Done http://gerrit.cloudera.org:8080/#/c/12065/5/testdata/data/README@302 PS5, Line 302: Contains two decimal columns, one with precision 1, the other with precision 10. : I used Hive 2.1.1 on top of Parquet-MR (6901a20) to create tiny, misaligned pages in : order to test the value-skipping logic in the Parquet column readers. > The other copies of this sentence could just point here. The blocks of copy Done http://gerrit.cloudera.org:8080/#/c/12065/5/testdata/data/README@305 PS5, Line 305: I slightly modified Parquet-MR (MIN_SLAB_SIZE = 1). > It could be useful to upload it to private github and add a link here. I used a snapshot version of hive and Parquet-MR and you'd need to use matching version of those, otherwise you'll likely to get some exceptions. I think the modification is fairly enough for people to reproduce. http://gerrit.cloudera.org:8080/#/c/12065/5/testdata/data/README@309 PS5, Line 309: (1, 1), (2, 2), (3, 3), (4, 4), (5, 5), > I didn't spot any tests where the page filtering eliminated all of the page Actually I didn't handle that case, modified the code a bit and added a test. http://gerrit.cloudera.org:8080/#/c/12065/5/testdata/data/README@319 PS5, Line 319: (1, 1), (2, 2), (3, 3), (4, 4), (5, NULL); > I didn't notice any all-null page in the data. It would be useful to create Actually the NULL in this line belongs to a null page because it's the second column (d_10). http://gerrit.cloudera.org:8080/#/c/12065/5/testdata/data/README@334 PS5, Line 334: select 2, array(cast (2 as decimal(1,0)), cast (2 as decimal(1,0)) ) union all > Does the empty lines have a meaning, e.g. do new pages start here for one No, it just separates pages that have different structures. http://gerrit.cloudera.org:8080/#/c/12065/5/testdata/workloads/functional-query/queries/QueryTest/parquet-page-index-alltypes-tiny-pages.test File testdata/workloads/functional-query/queries/QueryTest/parquet-page-index-alltypes-tiny-pages.test: http://gerrit.cloudera.org:8080/#/c/12065/5/testdata/workloads/functional-query/queries/QueryTest/parquet-page-index-alltypes-tiny-pages.test@3 PS5, Line 3: select * > While it is useful to have a few tests that show the whole output, most tes I'd rather keep the data to check that the decoders and the skipping logic don't drift somewhere. http://gerrit.cloudera.org:8080/#/c/12065/5/testdata/workloads/functional-query/queries/QueryTest/parquet-page-index-alltypes-tiny-pages.test@4 PS5, Line 4: ---- RESULTS > Some results could be crosschecked with the same query when READ_PARQUET_PA At first I generated the results with READ_PARQUET_PAGE_INDEX=false. Then, validated it with READ_PARQUET_PAGE_INDEX=true. Then I added the "RUNTIME_PROFILE" sections and generated the results with READ_PARQUET_PAGE_INDEX=true. http://gerrit.cloudera.org:8080/#/c/12065/5/testdata/workloads/functional-query/queries/QueryTest/parquet-page-index-alltypes-tiny-pages.test@95 PS5, Line 95: select * from alltypes_tiny_pages where bigint_col = 0 and month = 7 and year = 2010 > It would be nice to have a test that demonstrates that column indexes for d There are tests with multiple predicates on different columns. I chose those predicates to filter out more rows together. -- To view, visit http://gerrit.cloudera.org:8080/12065 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: comment Gerrit-Change-Id: I0cc99f129f2048dbafbe7f5a51d1ea3a5005731a Gerrit-Change-Number: 12065 Gerrit-PatchSet: 6 Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com> Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com> Gerrit-Reviewer: Lars Volker <l...@cloudera.com> Gerrit-Reviewer: Michael Ho <k...@cloudera.com> Gerrit-Reviewer: Pooja Nilangekar <pooja.nilange...@cloudera.com> Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com> Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com> Gerrit-Comment-Date: Thu, 14 Mar 2019 17:01:04 +0000 Gerrit-HasComments: Yes