Prakhar Jain resolved PARQUET-2117.
    Fix Version/s: 1.12.3
                       (was: 1.13.0)
       Resolution: Fixed

> Add rowPosition API in parquet record readers
> ---------------------------------------------
>                 Key: PARQUET-2117
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2117
>             Project: Parquet
>          Issue Type: New Feature
>          Components: parquet-mr
>            Reporter: Prakhar Jain
>            Priority: Major
>             Fix For: 1.12.3
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980

This message was sent by Atlassian Jira

Reply via email to