[jira] [Updated] (SPARK-38094) Parquet: enable matching schema columns by field id

Jackie Zhang (Jira) Tue, 08 Feb 2022 17:52:27 -0800


     [ 
https://issues.apache.org/jira/browse/SPARK-38094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jackie Zhang updated SPARK-38094:
---------------------------------
    Description: 
Field Id is a native field in the Parquet schema 
([https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L398])

After this PR, when the requested schema has field IDs, Parquet readers will 
first use the field ID to determine which Parquet columns to read, before 
falling back to using column names as before. It enables matching columns by 
field id for supported DWs like iceberg and Delta.

This PR supports:
 * vectorized reader
 * Parquet-mr reader

  was:
Field Id is a native field in the Parquet schema 
([https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L398])

After this PR, when the requested schema has field IDs, Parquet readers will 
first use the field ID to determine which Parquet columns to read, before 
falling back to using column names as before. It enables matching columns by 
field id for supported DWs like iceberg and Delta.

This PR supports:
 * vectorized reader

does not support:
 * Parquet-mr reader due to lack of field id support (needs a follow up ticket)


> Parquet: enable matching schema columns by field id
> ---------------------------------------------------
>
>                 Key: SPARK-38094
>                 URL: https://issues.apache.org/jira/browse/SPARK-38094
>             Project: Spark
>          Issue Type: New Feature
>          Components: Spark Core
>    Affects Versions: 3.3.0
>            Reporter: Jackie Zhang
>            Priority: Major
>
> Field Id is a native field in the Parquet schema 
> ([https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L398])
> After this PR, when the requested schema has field IDs, Parquet readers will 
> first use the field ID to determine which Parquet columns to read, before 
> falling back to using column names as before. It enables matching columns by 
> field id for supported DWs like iceberg and Delta.
> This PR supports:
>  * vectorized reader
>  * Parquet-mr reader



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-38094) Parquet: enable matching schema columns by field id

Reply via email to