[Impala-ASF-CR] IMPALA-5843: Use page index in Parquet files to skip pages

Csaba Ringhofer (Code Review) Thu, 28 Feb 2019 14:12:35 -0800

Csaba Ringhofer has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12065 )


Change subject: IMPALA-5843: Use page index in Parquet files to skip pages
......................................................................


Patch Set 5:

(31 comments)

Great work with the tests!

I am generally ok with the patch, my only complaint is that 
parquet-column-readers.cc became even more ridiculously complex. I would 
welcome any idea do make it somewhat simpler even at the cast of small 
performance regressions in some cases.

http://gerrit.cloudera.org:8080/#/c/12065/5//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12065/5//COMMIT_MSG@30
PS5, Line 30: Testing
I am curious: did you run FE/EE tests on data that was loaded with Parquet 
index writing turned on? I would expect some planner tests to be broken by the 
changed file sizes.


http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/hdfs-scan-node-base.cc
File be/src/exec/hdfs-scan-node-base.cc:

http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/hdfs-scan-node-base.cc@610
PS5, Line 610:   buffer_opts, move(sub_ranges), metadata);
nit: indentation


http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/hdfs-parquet-scanner.h
File be/src/exec/parquet/hdfs-parquet-scanner.h:

http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/hdfs-parquet-scanner.h@324
PS5, Line 324: /// the ScannerContext.
It would be very nice to extend this documentation with some high level info on 
page filtering and row skipping.


http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/hdfs-parquet-scanner.h@413
PS5, Line 413: /// Contains the leftover ranges after evaluating the Page index.
It could be mentioned that empty candidate_ranges_ means that there is no page 
filtering, so every page is needed.


http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/hdfs-parquet-scanner.h@462
PS5, Line 462:   /// Number of pages that need to be read.
             :   RuntimeProfile::Counter* num_pages_counter_;
The comment is a bit confusing: pages dropped by stats filtering do not "have 
to be read", while num_pages_counter_ seems to contain the total number of 
pages.


http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/hdfs-parquet-scanner.h@464
PS5, Line 464:
It could be also useful to also track partially filtered pages. This could make 
tests more exact and help in the future if we want to experiment with writing 
pages in an aligned way. I am ok with not doing it in this patch.

Another idea for an extra stat is to track two kind of filtered pages 
separately:
- those that were dropped directly by a predicate on their column
- those that were dropped because they were completely covered with skipped row 
ranges


http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/hdfs-parquet-scanner.h@512
PS5, Line 512: initialize
nit: initializing?


http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/hdfs-parquet-scanner.h@523
PS5, Line 523: agrees
nit: agree


http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/hdfs-parquet-scanner.cc@810
PS5, Line 810: num_pages_counter_
This is the only place I found where this counter is incremented, which means 
that it will remain 0 if no page filtering occurred. I would prefer to always 
set it.


http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/hdfs-parquet-scanner.cc@1816
PS5, Line 1816: candidate_ranges_.empty()
expected_rows_in_group could become a member and the effect of page filtering 
could be applied to it, so this test would be still valid in scans with 
filtered pages.


http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/hdfs-parquet-scanner.cc@1846
PS5, Line 1846:     if (candidate_ranges_.empty()) 
DCHECK_EQ(reader->num_values_read_, num_values_read);
              :     // ReadDataPage() uses metadata_->num_values to determine 
when the column's done
              :     DCHECK(!candidate_ranges_.empty() ||
              :         (reader->num_values_read_ == 
reader->metadata_->num_values ||
              :         !state_->abort_on_error()));
Is it possible to "revive" these tests in the page filtered world?


http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/parquet-column-readers.h
File be/src/exec/parquet/parquet-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/parquet-column-readers.h@486
PS5, Line 486: current_top_level_row_
current_top_level_row_ was renamed to current_row_. Note that I prefer the old 
name.


http://gerrit.cloudera.org:8080/#/c/12065/4/be/src/exec/parquet/parquet-column-readers.cc
File be/src/exec/parquet/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/12065/4/be/src/exec/parquet/parquet-column-readers.cc@1088
PS4, Line 1088:
> Removed DCHECK.
Can't the negative sub_range length cause trouble somewhere later?


http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/parquet-column-readers.cc
File be/src/exec/parquet/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/parquet-column-readers.cc@454
PS5, Line 454: while
I think that this loop is way too large and is ripe for refactor. My idea is to 
keep the NextPage() and the candidate_data_pages_ logic here and move the 
middle of the loop to a function like ReadValueBatchWithinPage()


http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/parquet-column-readers.cc@458
PS5, Line 458:       if (!NextPage()) {
Mentioning that NextPage() skips rows if the page starts within a filtered 
range would make it clear why the skipping logic is at the end of the loop.


http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/parquet-column-readers.cc@466
PS5, Line 466:     if (!MATERIALIZED && !IN_COLLECTION) {
             :       int vals_to_add = min(num_buffered_values_, max_values - 
val_count);
             :       val_count += vals_to_add;
             :       num_buffered_values_ -= vals_to_add;
             :       current_row_ += vals_to_add;
             :       DCHECK_GE(num_buffered_values_, 0);
             :       continue;
             :     }
This block doesn't seem to respect page skipping. Is this valid because if 
MATERIALIZED is false (which means count(*) without predicates) then there 
cannot be any page filtering?This could be mentioned in a comment and checked 
with a DCHECK.


http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/parquet-column-readers.cc@478
PS5, Line 478: (IN_COLLECTION && pos_slot_desc_ != nullptr) || 
!candidate_data_pages_.empty()
Do we have to care about rep levels if we are not IN_COLLECTION and there are 
some filtered pages?


http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/parquet-column-readers.cc@544
PS5, Line 544:   if ((IN_COLLECTION && pos_slot_desc_ != nullptr) || 
!candidate_data_pages_.empty()) {
Same as at line 478.


http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/parquet-column-readers.cc@553
PS5, Line 553:     if (!candidate_data_pages_.empty()) {
             :       int peek_rep_level = IN_COLLECTION ? 
rep_levels_.CachePeekNext() : 0;
IN_COLLECTION could be moved to the 'if', as the block seems a NOOP if it is 
false.


http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/parquet-column-readers.cc@599
PS5, Line 599:   if (pos_slot_desc_ != nullptr || 
!candidate_data_pages_.empty()) {
             :     DCHECK(rep_levels_.CacheHasNext());
             :   }
This seems weird if we are not in a collection.


http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/parquet-column-readers.cc@629
PS5, Line 629:   // Page filtering can also put an upper limit on 
'num_def_levels_to_consume'.
             :   if (!candidate_data_pages_.empty()) {
             :     // If we filter pages, the filtered range also puts a limit 
on
             :     // 'num_def_levels_to_consume'.
These two comments seem to say the same thing.


http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/parquet-column-readers.cc@1105
PS5, Line 1105:     else if (data_start < data_start_based_on_offset_index) {
              :       // 'dictionary_page_offset' is not set, but the offset 
index and
              :       // column chunk metadata disagree on the data start => 
column chunk's data start
              :       // is actually the location of the dictionary page.
              :       int64_t dict_start = data_start;
              :       sub_ranges.push_back({dict_start, 
data_start_based_on_offset_index - dict_start});
Is there a tool that creates such Parquet files?


http://gerrit.cloudera.org:8080/#/c/12065/5/testdata/data/README
File testdata/data/README:

http://gerrit.cloudera.org:8080/#/c/12065/5/testdata/data/README@301
PS5, Line 301: decimals_1_10.parquet
nit: most file names are followed by ":" (the last 4 are exceptions)


http://gerrit.cloudera.org:8080/#/c/12065/5/testdata/data/README@302
PS5, Line 302: Contains two decimal columns, one with precision 1, the other 
with precision 10.
             : I used Hive 2.1.1 on top of Parquet-MR (6901a20) to create tiny, 
misaligned pages in
             : order to test the value-skipping logic in the Parquet column 
readers.
The other copies of this sentence could just point here. The blocks of 
copy-paste make it harder to notice that the Hive parameters are actually 
different.


http://gerrit.cloudera.org:8080/#/c/12065/5/testdata/data/README@305
PS5, Line 305: I slightly modified Parquet-MR (MIN_SLAB_SIZE = 1).
It could be useful to upload it to private github and add a link here.


http://gerrit.cloudera.org:8080/#/c/12065/5/testdata/data/README@309
PS5, Line 309: (1, 1), (2, 2), (3, 3), (4, 4), (5, 5),
I didn't spot any tests where the page filtering eliminated all of the pages. 
Creating a query that survives column stat filtering but eliminates all pages 
during page filtering would need pages with gaps between max and min values, 
e.g. the missing 6 in (min=1 max=5) (min=7 max=11).


http://gerrit.cloudera.org:8080/#/c/12065/5/testdata/data/README@319
PS5, Line 319:  (1, 1), (2, 2), (3, 3), (4, 4), (5, NULL);
I didn't notice any all-null page in the data. It would be useful to create one 
to test the skipping of null pages.


http://gerrit.cloudera.org:8080/#/c/12065/5/testdata/data/README@334
PS5, Line 334:
Does the empty lines have  a meaning, e.g. do new pages start here for one of 
the columns?


http://gerrit.cloudera.org:8080/#/c/12065/5/testdata/workloads/functional-query/queries/QueryTest/parquet-page-index-alltypes-tiny-pages.test
File 
testdata/workloads/functional-query/queries/QueryTest/parquet-page-index-alltypes-tiny-pages.test:

http://gerrit.cloudera.org:8080/#/c/12065/5/testdata/workloads/functional-query/queries/QueryTest/parquet-page-index-alltypes-tiny-pages.test@3
PS5, Line 3: select *
While it is useful to have a few tests that show the whole output, most tests 
would be ok with select count(*) instead of select *.


http://gerrit.cloudera.org:8080/#/c/12065/5/testdata/workloads/functional-query/queries/QueryTest/parquet-page-index-alltypes-tiny-pages.test@4
PS5, Line 4: ---- RESULTS
Some results could be crosschecked with the same query when 
READ_PARQUET_PAGE_INDEX=0.


http://gerrit.cloudera.org:8080/#/c/12065/5/testdata/workloads/functional-query/queries/QueryTest/parquet-page-index-alltypes-tiny-pages.test@95
PS5, Line 95: select * from alltypes_tiny_pages where bigint_col = 0 and month 
= 7 and year = 2010
It would be nice to have a test that demonstrates that column indexes for 
different columns "work together", e.g.
a: where month = 7
b: where year = 2010
c: where month = 7 and year = 2010
The NumStatsFilteredPages of c. is likely to be higher than a. or b., which 
would show that both columns are used for filtering. Sorry if there is already 
such and I just didn't spot it.



--
To view, visit http://gerrit.cloudera.org:8080/12065
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0cc99f129f2048dbafbe7f5a51d1ea3a5005731a
Gerrit-Change-Number: 12065
Gerrit-PatchSet: 5
Gerrit-Owner: Zoltan Borok-Nagy <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Lars Volker <[email protected]>
Gerrit-Reviewer: Pooja Nilangekar <[email protected]>
Gerrit-Reviewer: Tim Armstrong <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Thu, 28 Feb 2019 22:12:17 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5843: Use page index in Parquet files to skip pages

Reply via email to