Prakhar Jain created PARQUET-2117:
-------------------------------------

             Summary: Add rowPosition API in parquet record readers
                 Key: PARQUET-2117
                 URL: https://issues.apache.org/jira/browse/PARQUET-2117
             Project: Parquet
          Issue Type: New Feature
          Components: parquet-mr
            Reporter: Prakhar Jain
             Fix For: 1.13.0


Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
parquet file in columnar fashion or record-by-record.

It will be great to extend them to also support rowPosition API which can tell 
the position of the current record in the parquet file.

The rowPosition can be used as a unique row identifier to mark a row. This can 
be useful to create an index (e.g. B+ tree) over a parquet file/parquet table 
(e.g.  Spark/Hive).

There are multiple projects in the parquet eco-system which can benefit from 
such a functionality: 
 #  Apache Iceberg needs this functionality. It has this implementation already 
as it relies on low level parquet APIs -  
[Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
 
[Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
 #  Apache Spark wants to expose this as a metadata column - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to