[
https://issues.apache.org/jira/browse/ARROW-11269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andy Grove updated ARROW-11269:
-------------------------------
Fix Version/s: 3.0.1
> [Rust] Unable to read Parquet file because of mismatch in column-derived and
> embedded schemas
> ---------------------------------------------------------------------------------------------
>
> Key: ARROW-11269
> URL: https://issues.apache.org/jira/browse/ARROW-11269
> Project: Apache Arrow
> Issue Type: Bug
> Components: Rust
> Affects Versions: 3.0.0
> Reporter: Max Burke
> Assignee: Neville Dipale
> Priority: Blocker
> Labels: pull-request-available
> Fix For: 4.0.0, 3.0.1
>
> Attachments: 0100c937-7c1c-78c4-1f4b-156ef04e79f0.parquet, main.rs
>
> Time Spent: 1h 50m
> Remaining Estimate: 0h
>
> The issue seems to stem from the new(-ish) behavior of the Arrow Parquet
> reader where the embedded arrow schema is used instead of deriving the schema
> from the Parquet columns.
>
> However it seems like some cases still derive the schema type from the column
> types, leading to the Arrow record batch reader erroring out that the column
> types must match the schema types.
>
> In our case, the column type is an int96 datetime (ns) type, and the Arrow
> type in the embedded schema is DataType::Timestamp(TimeUnit::Nanoseconds,
> Some("UTC")). However, the code that constructs the Arrays seems to re-derive
> this column type as DataType::Timestamp(TimeUnit::Nanoseconds, None) (because
> the Parquet schema has no timezone information). And so, Parquet files that
> we were able to read successfully with our branch of Arrow circa October are
> now unreadable.
>
> I've attached an example of a Parquet file that demonstrates the problem.
> This file was created in Python (as most of our Parquet files are).
>
> I've also attached a sample Rust program that will demonstrate the error.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)