Max Burke created ARROW-11269:
---------------------------------
Summary: [Rust] Unable to read Parquet file because of mismatch
Key: ARROW-11269
URL: https://issues.apache.org/jira/browse/ARROW-11269
Project: Apache Arrow
Issue Type: Bug
Components: Rust
Affects Versions: 3.0.0
Reporter: Max Burke
Attachments: 0100c937-7c1c-78c4-1f4b-156ef04e79f0.parquet
The issue seems to stem from the new(-ish) behavior of the Arrow Parquet reader
where the embedded arrow schema is used instead of deriving the schema from the
Parquet columns.
However it seems like some cases still derive the schema type from the column
types, leading to the Arrow record batch reader erroring out that the column
types must match the schema types.
In our case, the column type is an int96 datetime (ns) type, and the Arrow type
in the embedded schema is DataType::Timestamp(TimeUnit::Nanoseconds,
Some("UTC")). However, the code that constructs the Arrays seems to re-derive
this column type as DataType::Timestamp(TimeUnit::Nanoseconds, None) (because
the Parquet schema has no timezone information). And so, Parquet files that we
were able to read successfully with our branch of Arrow circa October are now
unreadable.
I've attached an example of a Parquet file that demonstrates the problem. This
file was created in Python (as most of our Parquet files are).
--
This message was sent by Atlassian Jira
(v8.3.4#803005)