[
https://issues.apache.org/jira/browse/PARQUET-1364?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Gabor Szadovszky updated PARQUET-1364:
--------------------------------------
Fix Version/s: (was: 1.11.0)
> Column Indexes: Invalid row indexes for pages starting with nulls
> -----------------------------------------------------------------
>
> Key: PARQUET-1364
> URL: https://issues.apache.org/jira/browse/PARQUET-1364
> Project: Parquet
> Issue Type: Sub-task
> Reporter: Gabor Szadovszky
> Assignee: Gabor Szadovszky
> Priority: Major
> Labels: pull-request-available
>
> The current implementation for writing managing row indexes for the pages is
> not reliable. There is a logic
> [MessageColumnIO|https://github.com/apache/parquet-mr/blob/master/parquet-column/src/main/java/org/apache/parquet/io/MessageColumnIO.java#L153]
> which caches null values and flush them just *before* opening a new group.
> This logic might cause starting pages with these cached nulls which are not
> correctly counted in the written rows so the rowIndexes are incorrect. It
> does not cause any issues if all the pages are read continuously put it is a
> huge problem for column index based filtering.
> The implementation described above is really complicated and would not like
> to redesign because of the mentioned issue. It is easier to simply count the
> {{0}} repetition levels as record boundaries at the column writer level.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)