[ 
https://issues.apache.org/jira/browse/ARROW-11269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Max Burke updated ARROW-11269:
------------------------------
    Attachment: main.rs

> [Rust] Unable to read Parquet file because of mismatch in column-derived and 
> embedded schemas
> ---------------------------------------------------------------------------------------------
>
>                 Key: ARROW-11269
>                 URL: https://issues.apache.org/jira/browse/ARROW-11269
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Rust
>    Affects Versions: 3.0.0
>            Reporter: Max Burke
>            Priority: Blocker
>         Attachments: 0100c937-7c1c-78c4-1f4b-156ef04e79f0.parquet, main.rs
>
>
> The issue seems to stem from the new(-ish) behavior of the Arrow Parquet 
> reader where the embedded arrow schema is used instead of deriving the schema 
> from the Parquet columns.
>  
> However it seems like some cases still derive the schema type from the column 
> types, leading to the Arrow record batch reader erroring out that the column 
> types must match the schema types.
>  
> In our case, the column type is an int96 datetime (ns) type, and the Arrow 
> type in the embedded schema is DataType::Timestamp(TimeUnit::Nanoseconds, 
> Some("UTC")). However, the code that constructs the Arrays seems to re-derive 
> this column type as DataType::Timestamp(TimeUnit::Nanoseconds, None) (because 
> the Parquet schema has no timezone information). And so, Parquet files that 
> we were able to read successfully with our branch of Arrow circa October are 
> now unreadable.
>  
> I've attached an example of a Parquet file that demonstrates the problem. 
> This file was created in Python (as most of our Parquet files are).
> I've also attached a sample Rust program that will demonstrate the error.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to