[jira] [Updated] (PARQUET-2117) Add rowPosition API in parquet record readers

Prakhar Jain (Jira) Tue, 01 Feb 2022 11:07:16 -0800


     [ 
https://issues.apache.org/jira/browse/PARQUET-2117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Prakhar Jain updated PARQUET-2117:
----------------------------------
    Description: 
Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
parquet file in columnar fashion or record-by-record.

It will be great to extend them to also support rowPosition API which can tell 
the position of the current record in the parquet file.

The rowPosition can be used as a unique row identifier to mark a row. This can 
be useful to create an index (e.g. B+ tree) over a parquet file/parquet table 
(e.g.  Spark/Hive).

There are multiple projects in the parquet eco-system which can benefit from 
such a functionality: 
 # Apache Iceberg needs this functionality. It has this implementation already 
as it relies on low level parquet APIs -  
[Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
 
[Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
 # Apache Spark can use this functionality - SPARK-37980

  was:
Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
parquet file in columnar fashion or record-by-record.

It will be great to extend them to also support rowPosition API which can tell 
the position of the current record in the parquet file.

The rowPosition can be used as a unique row identifier to mark a row. This can 
be useful to create an index (e.g. B+ tree) over a parquet file/parquet table 
(e.g.  Spark/Hive).

There are multiple projects in the parquet eco-system which can benefit from 
such a functionality: 
 #  Apache Iceberg needs this functionality. It has this implementation already 
as it relies on low level parquet APIs -  
[Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
 
[Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
 #  Apache Spark wants to expose this as a metadata column - SPARK-37980


> Add rowPosition API in parquet record readers
> ---------------------------------------------
>
>                 Key: PARQUET-2117
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2117
>             Project: Parquet
>          Issue Type: New Feature
>          Components: parquet-mr
>            Reporter: Prakhar Jain
>            Priority: Major
>             Fix For: 1.13.0
>
>
> Currently the parquet-mr RecordReader/ParquetFileReader exposes API’s to read 
> parquet file in columnar fashion or record-by-record.
> It will be great to extend them to also support rowPosition API which can 
> tell the position of the current record in the parquet file.
> The rowPosition can be used as a unique row identifier to mark a row. This 
> can be useful to create an index (e.g. B+ tree) over a parquet file/parquet 
> table (e.g.  Spark/Hive).
> There are multiple projects in the parquet eco-system which can benefit from 
> such a functionality: 
>  # Apache Iceberg needs this functionality. It has this implementation 
> already as it relies on low level parquet APIs -  
> [Link1|https://github.com/apache/iceberg/blob/apache-iceberg-0.12.1/parquet/src/main/java/org/apache/iceberg/parquet/ReadConf.java#L171],
>  
> [Link2|https://github.com/apache/iceberg/blob/d4052a73f14b63e1f519aaa722971dc74f8c9796/core/src/main/java/org/apache/iceberg/MetadataColumns.java#L37]
>  # Apache Spark can use this functionality - SPARK-37980



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (PARQUET-2117) Add rowPosition API in parquet record readers

Reply via email to