[Impala-ASF-CR] IMPALA-6433: Add read support for PageHeaderV2

Impala Public Jenkins (Code Review) Fri, 12 May 2023 11:33:05 -0700

Impala Public Jenkins has submitted this change and it was merged. ( 
http://gerrit.cloudera.org:8080/19793 )


Change subject: IMPALA-6433: Add read support for PageHeaderV2
......................................................................

IMPALA-6433: Add read support for PageHeaderV2

Parquet v2 means several changes in Parquet files compared to v1:

1. file version = 2 instead of 1

https://github.com/apache/parquet-format/blob/c185faf0c4fc0c7d3075d1abd4ed0985cdbe5d87/src/main/thrift/parquet.thrift#L1016
Before this patch Impala rejected Parquet files with version!=1.

2. possible use of DataPageHeaderV2 instead DataPageHeader

https://github.com/apache/parquet-format/blob/c185faf0c4fc0c7d3075d1abd4ed0985cdbe5d87/src/main/thrift/parquet.thrift#L561

The main differences compared to V1 DataPageHeader:
a. rep/def levels are not compressed, so the compressed part contains
   only the actual encoded values
b. rep/def levels must be RLE encoded (Impala only supports RLE encoded
   levels even for V1 pages)
c. compression can be turned on/off per page (member is_compressed)
d. number of nulls (member num_nulls) is required - in v1 it was
   included in statistics which is optional
e. number of rows is required (member num_rows) which can help with
   matching collection items with the top level collection

The patch adds support for understanding v2 data pages but does not
implement some potential optimizations:

a. would allow an optimization for queries that need only the nullness
of a column but not the actual value: as the values are not needed the
decompression of the page data can be skipped. This optimization is not
implemented - currently Impala materializes both the null bit and the
value for all columns regardless of whether the value is actually
needed.

d. could be also used for optimizations / additional validity checks
but it is not used currently

e. could make skipping rows easier but is not used, as the existing
scanner has to be able to skip rows efficiently also in v1 files so
it can't rely on num_rows

3. possible use of new encodings (e.g. DELTA_BINARY_PACKED)

No new encoding is added - when an unsupported encoding is encountered
Impala returns an error.

parquet-mr uses new encodings (DELTA_BINARY_PACKED, DELTA_BYTE_ARRAY)
for most types if the file version is 2, so with this patch Impala is
not yet able to read all v2 Parquet tables written by Hive.

4. Encoding PLAIN_DICTIONARY is deprecated and RLE_DICTIONARY is used
instead. The semantics of the two encodings are exactly the same.

Additional changes:
Some responsibilites are moved from ParquetColumnReader to
ParquetColumnChunkReader:
- ParquetColumnChunkReader decodes rep/def level sizes to hide v1/v2
  differences (see 2.a.)
- ParquetColumnChunkReader skips empty data pages in
  ReadNextDataPageHeader
- the state machine of ParquetColumnChunkReader is simplified by
  separating data page header reading / reading rest of the page

Testing:
- added 4 v2 Parquet test tables (written by Hive) to cover
  compressed / uncompressed and scalar/complex cases
- added EE and fuzz tests for the test tables above
- manual tested v2 Parquet files written by pyarrow
- ran core tests

Note that no test is added where some pages are compressed while
some are not. It would be tricky to create such files with existing
writers. The code should handle this case and it is very unlikely that
files like this will be encountered.

Change-Id: I282962a6e4611e2b662c04a81592af83ecaf08ca
Reviewed-on: http://gerrit.cloudera.org:8080/19793
Reviewed-by: Impala Public Jenkins <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>
---
M be/src/exec/parquet/hdfs-parquet-table-writer.cc
M be/src/exec/parquet/parquet-column-chunk-reader.cc
M be/src/exec/parquet/parquet-column-chunk-reader.h
M be/src/exec/parquet/parquet-column-readers.cc
M be/src/exec/parquet/parquet-column-readers.h
M be/src/exec/parquet/parquet-common.h
M be/src/exec/parquet/parquet-level-decoder.cc
M be/src/exec/parquet/parquet-level-decoder.h
M be/src/exec/parquet/parquet-metadata-utils.cc
M testdata/datasets/functional/functional_schema_template.sql
M testdata/datasets/functional/schema_constraints.csv
A testdata/workloads/functional-query/queries/QueryTest/parquet-v2.test
M tests/query_test/test_scanners.py
M tests/query_test/test_scanners_fuzz.py
14 files changed, 566 insertions(+), 187 deletions(-)

Approvals:
  Impala Public Jenkins: Looks good to me, approved; Verified

--
To view, visit http://gerrit.cloudera.org:8080/19793
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: merged
Gerrit-Change-Id: I282962a6e4611e2b662c04a81592af83ecaf08ca
Gerrit-Change-Number: 19793
Gerrit-PatchSet: 18
Gerrit-Owner: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Daniel Becker <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>

[Impala-ASF-CR] IMPALA-6433: Add read support for PageHeaderV2

Reply via email to