[
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Prakhar Jain updated PARQUET-2117:
----------------------------------
Description:
Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read
parquet file in columnar fashion or record-by-record.
It will be great to extend them to also support rowPosition API which can tell
the position of the current record in the parquet file.
The rowPosition can be used as a unique row identifier to mark a row. This can
be useful to create an index (e.g. B+ tree) over a parquet file/parquet table
(e.g. Spark/Hive).
There are multiple projects in the parquet eco-system which can benefit from
such a functionality:
# Apache Iceberg needs this functionality. It has this implementation already
as it relies on low level parquet APIs -
[Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
[Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
# Apache Spark can use this functionality - SPARK-37980
was:
Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read
parquet file in columnar fashion or record-by-record.
It will be great to extend them to also support rowPosition API which can tell
the position of the current record in the parquet file.
The rowPosition can be used as a unique row identifier to mark a row. This can
be useful to create an index (e.g. B+ tree) over a parquet file/parquet table
(e.g. Spark/Hive).
There are multiple projects in the parquet eco-system which can benefit from
such a functionality:
# Apache Iceberg needs this functionality. It has this implementation already
as it relies on low level parquet APIs -
[Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
[Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
# Apache Spark wants to expose this as a metadata column - SPARK-37980
> Add rowPosition API in parquet record readers
> ---------------------------------------------
>
> Key: PARQUET-2117
> URL: https://issues.apache.org/jira/browse/PARQUET-2117
> Project: Parquet
> Issue Type: New Feature
> Components: parquet-mr
> Reporter: Prakhar Jain
> Priority: Major
> Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet
> table (e.g. Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from
> such a functionality:
> # Apache Iceberg needs this functionality. It has this implementation
> already as it relies on low level parquet APIs -
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
> # Apache Spark can use this functionality - SPARK-37980
--
This message was sent by Atlassian Jira
(v8.20.1#820001)