[Impala-ASF-CR] IMPALA-5843: Use page index in Parquet files to skip pages

Zoltan Borok-Nagy (Code Review) Thu, 14 Mar 2019 10:02:19 -0700

Zoltan Borok-Nagy has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/12065 )


Change subject: IMPALA-5843: Use page index in Parquet files to skip pages
......................................................................


Patch Set 6:

(30 comments)

http://gerrit.cloudera.org:8080/#/c/12065/5//COMMIT_MSG
Commit Message:

http://gerrit.cloudera.org:8080/#/c/12065/5//COMMIT_MSG@30
PS5, Line 30: Testing
> I am curious: did you run FE/EE tests on data that was loaded with Parquet
I needed to update the stats extrapolation tests.


http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/hdfs-scan-node-base.cc
File be/src/exec/hdfs-scan-node-base.cc:

http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/hdfs-scan-node-base.cc@610
PS5, Line 610:       buffer_opts, move(sub_ranges), metadata);
> nit: indentation
Done


http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/hdfs-parquet-scanner.h
File be/src/exec/parquet/hdfs-parquet-scanner.h:

http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/hdfs-parquet-scanner.h@324
PS5, Line 324: /// the ScannerContext.
> It would be very nice to extend this documentation with some high level inf
Added some comments.


http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/hdfs-parquet-scanner.h@413
PS5, Line 413: ParquetFileVersion file_version_;
> It could be mentioned that empty candidate_ranges_ means that there is no p
Done


http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/hdfs-parquet-scanner.h@462
PS5, Line 462:
             :   /// Number of columns that need to be read.
> The comment is a bit confusing: pages dropped by stats filtering do not "ha
I uses similar comments that we use for row group counters. Anyway, I modified 
and extended the comment.


http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/hdfs-parquet-scanner.h@464
PS5, Line 464:   RuntimeProfile::Counter* num_cols_counter_;
> It could be also useful to also track partially filtered pages. This could
These counters could be useful for debugging, but confusing and too low-level 
in the profile in my opinion.

I think the predicates in the query string and the existing counters above give 
us enough information to validate page selection.

I added a new counter for row groups that are filtered by the page index.


http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/hdfs-parquet-scanner.h@512
PS5, Line 512:
> nit: initializing?
Done


http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/hdfs-parquet-scanner.h@523
PS5, Line 523:  Recov
> nit: agree
Done


http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/hdfs-parquet-scanner.cc
File be/src/exec/parquet/hdfs-parquet-scanner.cc:

http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/hdfs-parquet-scanner.cc@810
PS5, Line 810: oup& row_group = f
> This is the only place I found where this counter is incremented, which mea
It is also incremented in 
parquet-column-readers.cc::BaseScalarColumnReader::ReadDataPage().


http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/hdfs-parquet-scanner.cc@1816
PS5, Line 1816:
> expected_rows_in_group could become a member and the effect of page filteri
We can calculate the expected row count here as well. Added a new nested if 
statement.


http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/hdfs-parquet-scanner.cc@1846
PS5, Line 1846:       }
              :     }
              :   }
              :
              :   // Validate scalar column readers'
> Is it possible to "revive" these tests in the page filtered world?
Currently 'reader->num_values_read_' is the sum of 'num_values' from the 
parquet pages. We subtract from this whenever we skip values, but it would make 
the column readers a bit more complex.

I have added some extra validation in CheckPageFiltering() for the case when we 
use the page index.


http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/parquet-column-readers.h
File be/src/exec/parquet/parquet-column-readers.h:

http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/parquet-column-readers.h@486
PS5, Line 486: values' or up to value
> current_top_level_row_ was renamed to current_row_. Note that I prefer the
I didn't want to spell "top_level" every time. And it was confusing to have 
both top level rows and rows in the code.

Now "row" means the top level rows and "value" means the "primitive" values.


http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/parquet-column-readers.cc
File be/src/exec/parquet/parquet-column-readers.cc:

http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/parquet-column-readers.cc@454
PS5, Line 454: while
> I think that this loop is way too large and is ripe for refactor. My idea i
Will think about it. Currently I'm not sure that substituting those lines with 
a function call that takes a lot of parameters would improve readability 
significantly.


http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/parquet-column-readers.cc@458
PS5, Line 458:     if (num_buffered_values_ == 0) {
> Mentioning that NextPage() skips rows if the page starts within a filtered
Extended the comment above.


http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/parquet-column-readers.cc@466
PS5, Line 466:     // count from page metadata to return the correct number of 
rows.
             :     if (!MATERIALIZED && !IN_COLLECTION) {
             :       // We cannot filter pages in this context.
             :       DCHECK(candidate_data_pages_.empty());
             :       int vals_to_add = min(num_buffered_values_, max_values - 
val_count);
             :       val_count += vals_to_add;
             :       num_buffered_values_ -= vals_to_add;
             :   
> This block doesn't seem to respect page skipping. Is this valid because if
Added comment and DCHECK


http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/parquet-column-readers.cc@478
PS5, Line 478: re about the nesting structure unless the position slot is being 
populated,
> Do we have to care about rep levels if we are not IN_COLLECTION and there a
Updated the condition.


http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/parquet-column-readers.cc@544
PS5, Line 544:   DCHECK_GT(num_buffered_values_, 0);
> Same as at line 478.
Done


http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/parquet-column-readers.cc@553
PS5, Line 553:   max_values = min(max_values, num_buffered_values_);
             :   while (def_levels_.CacheHasNext() && val_count < max_values) {
> IN_COLLECTION could be moved to the 'if', as the block seems a NOOP if it i
It's not a NOOP in that case. 'peek_rep_level' will be 0, which is part of the 
condition in the following line.


http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/parquet-column-readers.cc@599
PS5, Line 599: }
             :
             : //
> This seems weird if we are not in a collection.
We don't have IN_COLLECTION template parameter here, but added 'max_rep_level_ 
> 0'


http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/parquet-column-readers.cc@629
PS5, Line 629:     }
             :   } else {
             :     // Cannot consume more levels than allowed by buffered input 
values and output space.
             :     num_def_levels_to_consume = min
> These two comments seem to say the same thing.
Done


http://gerrit.cloudera.org:8080/#/c/12065/5/be/src/exec/parquet/parquet-column-readers.cc@1105
PS5, Line 1105:     }
              :     else if (data_start < data_start_based_on_offset_index) {
              :       // 'dictionary_page_offset' is not set, but the offset 
index and
              :       // column chunk metadata disagree on the data start => 
column chunk's data start
              :       // is actually the location of the dictionary page.
              :       int64_t dict_start = data_start;
> Is there a tool that creates such Parquet files?
Parquet-MR behaves like that. It doesn't populate the 'dictionary_page_offset', 
but set 'data_page_offset' to the dictionary page.


http://gerrit.cloudera.org:8080/#/c/12065/5/testdata/data/README
File testdata/data/README:

http://gerrit.cloudera.org:8080/#/c/12065/5/testdata/data/README@301
PS5, Line 301: decimals_1_10.parquet:
> nit: most file names are followed by ":" (the last 4 are exceptions)
Done


http://gerrit.cloudera.org:8080/#/c/12065/5/testdata/data/README@302
PS5, Line 302: Contains two decimal columns, one with precision 1, the other 
with precision 10.
             : I used Hive 2.1.1 on top of Parquet-MR (6901a20) to create tiny, 
misaligned pages in
             : order to test the value-skipping logic in the Parquet column 
readers.
> The other copies of this sentence could just point here. The blocks of copy
Done


http://gerrit.cloudera.org:8080/#/c/12065/5/testdata/data/README@305
PS5, Line 305: I slightly modified Parquet-MR (MIN_SLAB_SIZE = 1).
> It could be useful to upload it to private github and add a link here.
I used a snapshot version of hive and Parquet-MR and you'd need to use matching 
version of those, otherwise you'll likely to get some exceptions.
I think the modification is fairly enough for people to reproduce.


http://gerrit.cloudera.org:8080/#/c/12065/5/testdata/data/README@309
PS5, Line 309: (1, 1), (2, 2), (3, 3), (4, 4), (5, 5),
> I didn't spot any tests where the page filtering eliminated all of the page
Actually I didn't handle that case, modified the code a bit and added a test.


http://gerrit.cloudera.org:8080/#/c/12065/5/testdata/data/README@319
PS5, Line 319:  (1, 1), (2, 2), (3, 3), (4, 4), (5, NULL);
> I didn't notice any all-null page in the data. It would be useful to create
Actually the NULL in this line belongs to a null page because it's the second 
column (d_10).


http://gerrit.cloudera.org:8080/#/c/12065/5/testdata/data/README@334
PS5, Line 334:                             select 2, array(cast (2 as 
decimal(1,0)), cast (2 as decimal(1,0)) ) union all
> Does the empty lines have  a meaning, e.g. do new pages start here for one
No, it just separates pages that have different structures.


http://gerrit.cloudera.org:8080/#/c/12065/5/testdata/workloads/functional-query/queries/QueryTest/parquet-page-index-alltypes-tiny-pages.test
File 
testdata/workloads/functional-query/queries/QueryTest/parquet-page-index-alltypes-tiny-pages.test:

http://gerrit.cloudera.org:8080/#/c/12065/5/testdata/workloads/functional-query/queries/QueryTest/parquet-page-index-alltypes-tiny-pages.test@3
PS5, Line 3: select *
> While it is useful to have a few tests that show the whole output, most tes
I'd rather keep the data to check that the decoders and the skipping logic 
don't drift somewhere.


http://gerrit.cloudera.org:8080/#/c/12065/5/testdata/workloads/functional-query/queries/QueryTest/parquet-page-index-alltypes-tiny-pages.test@4
PS5, Line 4: ---- RESULTS
> Some results could be crosschecked with the same query when READ_PARQUET_PA
At first I generated the results with READ_PARQUET_PAGE_INDEX=false. Then, 
validated it with READ_PARQUET_PAGE_INDEX=true. Then I added the 
"RUNTIME_PROFILE" sections and generated the results with 
READ_PARQUET_PAGE_INDEX=true.


http://gerrit.cloudera.org:8080/#/c/12065/5/testdata/workloads/functional-query/queries/QueryTest/parquet-page-index-alltypes-tiny-pages.test@95
PS5, Line 95: select * from alltypes_tiny_pages where bigint_col = 0 and month 
= 7 and year = 2010
> It would be nice to have a test that demonstrates that column indexes for d
There are tests with multiple predicates on different columns. I chose those 
predicates to filter out more rows together.



--
To view, visit http://gerrit.cloudera.org:8080/12065
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I0cc99f129f2048dbafbe7f5a51d1ea3a5005731a
Gerrit-Change-Number: 12065
Gerrit-PatchSet: 6
Gerrit-Owner: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Reviewer: Csaba Ringhofer <csringho...@cloudera.com>
Gerrit-Reviewer: Impala Public Jenkins <impala-public-jenk...@cloudera.com>
Gerrit-Reviewer: Lars Volker <l...@cloudera.com>
Gerrit-Reviewer: Michael Ho <k...@cloudera.com>
Gerrit-Reviewer: Pooja Nilangekar <pooja.nilange...@cloudera.com>
Gerrit-Reviewer: Tim Armstrong <tarmstr...@cloudera.com>
Gerrit-Reviewer: Zoltan Borok-Nagy <borokna...@cloudera.com>
Gerrit-Comment-Date: Thu, 14 Mar 2019 17:01:04 +0000
Gerrit-HasComments: Yes

[Impala-ASF-CR] IMPALA-5843: Use page index in Parquet files to skip pages

Reply via email to