Hello Tamas Mate, [email protected], Gergely Fürnstáhl, Csaba Ringhofer,
Impala Public Jenkins,
I'd like you to reexamine a change. Please visit
http://gerrit.cloudera.org:8080/19328
to look at the new patch set (#4).
Change subject: IMPALA-11780: Wrong FILE__POSITION values for multi row group
Parquet files when page filtering is used
......................................................................
IMPALA-11780: Wrong FILE__POSITION values for multi row group Parquet files
when page filtering is used
Impala generated wrong values for the FILE__POSITION column when the
Parquet file contained multiple row groups and page filtering was
used as well.
We are using the value of 'current_row_' in the Parquet column readers
to populate the file position slot. The problem is that 'current_row_'
denotes the index of the row within the row group and not within the
file. We cannot change 'current_row_' as page filtering depends on its
value, as the page index also uses the row group-based indexes of the
rows, not the file indexes.
In the meantime it turned out FILE__POSITION was also not set correctly
in the Parquet late materialization code, as
BaseScalarColumnReader::SkipRowsInternal() didn't update 'current_row_'
in some code paths.
The value of FILE__POSITION is critical for Iceberg V2 tables as
position delete files store file positions of the deleted rows.
Testing:
* added e2e tests
* the tests are now running w/o PARQUET_READ_STATISTICS to exercise
more code paths
Change-Id: I5ef37a1aa731eb54930d6689621cd6169fed6605
(cherry picked from commit b71a18bc82629c71aba8d5a55fe91fb04c975ae1)
---
M be/src/exec/parquet/parquet-column-readers.cc
M be/src/exec/parquet/parquet-column-readers.h
M
testdata/workloads/functional-query/queries/QueryTest/virtual-column-file-position-parquet.test
M tests/query_test/test_scanners.py
M tests/util/get_parquet_metadata.py
5 files changed, 129 insertions(+), 14 deletions(-)
git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/28/19328/4
--
To view, visit http://gerrit.cloudera.org:8080/19328
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newpatchset
Gerrit-Change-Id: I5ef37a1aa731eb54930d6689621cd6169fed6605
Gerrit-Change-Number: 19328
Gerrit-PatchSet: 4
Gerrit-Owner: Zoltan Borok-Nagy <[email protected]>
Gerrit-Reviewer: Anonymous Coward <[email protected]>
Gerrit-Reviewer: Csaba Ringhofer <[email protected]>
Gerrit-Reviewer: Gergely Fürnstáhl <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Tamas Mate <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>