Prakhar Jain created PARQUET-2117:
-------------------------------------
Summary: Add rowPosition API in parquet record readers
Key: PARQUET-2117
URL: https://issues.apache.org/jira/browse/PARQUET-2117
Project: Parquet
Issue Type: New Feature
Components: parquet-mr
Reporter: Prakhar Jain
Fix For: 1.13.0
Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read
parquet file in columnar fashion or record-by-record.
It will be great to extend them to also support rowPosition API which can tell
the position of the current record in the parquet file.
The rowPosition can be used as a unique row identifier to mark a row. This can
be useful to create an index (e.g. B+ tree) over a parquet file/parquet table
(e.g. Spark/Hive).
There are multiple projects in the parquet eco-system which can benefit from
such a functionality:
# Apache Iceberg needs this functionality. It has this implementation already
as it relies on low level parquet APIs -
[Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
[Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
# Apache Spark wants to expose this as a metadata column - SPARK-37980
--
This message was sent by Atlassian Jira
(v8.20.1#820001)