[GitHub] [arrow-rs] tustvold opened a new issue, #1666: Handling Unsupported Arrow Types in Parquet

GitBox Sat, 07 May 2022 00:22:22 -0700


tustvold opened a new issue, #1666:
URL: https://github.com/apache/arrow-rs/issues/1666

# Problem

Data in parquet is stored as one of a limited number of physical types,
there are then three mechanisms that an arrow reader can use to infer the type
of the data.

1. The deprecated ConvertedType enumeration stored within the parquet schema
2. The
[LogicalType](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md)
enumeration stored within the parquet schema
3. An embedded arrow schema stored within the parquet file metadata

All parquet readers support 1, all v2 readers support 2, and only arrow
readers support 3.

In some cases the logical type is not necessary to correctly interpret the
data, e.g. String vs LargeString, but in some cases it fundamentally alters the
meaning of the data.

## Timestamp

A nanosecond time units is not supported in 1, and a second time unit is
only natively supported in 3.

Currently this crate encodes second time units as logical type of
milliseconds, this is likely a bug. It also does not convert to nanoseconds to
microseconds when using writer version 1, despite this only being supported in
>2.6.

The python implementation will, depending on the writer version, potentially
cast nanoseconds to microseconds, and seconds to milliseconds - see
[here](https://arrow.apache.org/docs/dev/python/parquet.html?highlight=round%20tripping#storing-timestamps)

There does not appear to be a way to round-trip timezones in 1 or 2. The C++
implementation appears to always normalize timezones to UTC and set
is_adjusted_to_utc to true.

Currently this crate sets is_adjusted_to_utc to true if the timezone is set,
this is despite the writer not actually performing the normalisation. I think
this is a bug.

## Date64

The arrow type Date64 is milliseconds since epoch and does not have an
equivalent ConvertedType nor LogicalType.

Currently this crate converts the type to Date32 on write, losing sub-second
precision. This what the C++ implementation does - see
[here](https://arrow.apache.org/docs/dev/cpp/parquet.html#logical-types).

## Interval

The interval data type has a ConvertedType but not a LogicalType, there is a
PR to add LogicalType support https://github.com/apache/parquet-format/pull/165
but it appears to have stalled somewhat.

An interval of MonthDayNano cannot be represented by the ConvertedType. The
C++ implementation does not appear to support Interval types.

# Proposal

There are broadly speaking 3 ways to handle data types that cannot be
represented in the parquet schema:

* Return an error
* Cast to a native parquet representation, potentially losing precision in
the process
* Encode the data, only encoding its logical type in the embedded arrow
schema

Returning an error is the safest, but is not a great UX. Casting to a native
parquet representation is inline with what other arrow implementations do and
gives the best ecosystem compatibility, but also doesn't make for a great UX,
just search StackOverflow for `allow_truncated_timestamps` to see the confusion
this causes.

The final option is the simplest to implement, the least surprising to
users, and what I would propose implementing. It would break ecosystem
interoperability in certain cases, but I think it is more important that we
faithfully round-trip data than maintain maximal compatibility. Users who then
care about interoperability can explicitly cast data as appropriate, similar to
python's `coerce_timestamps` functionality.

Thoughts @sunchao @nevi-me @alamb @jorisvandenbossche @jorgecarleitao

# Additional Context

* #1459
* #1275
* #1655
* #1663

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] tustvold opened a new issue, #1666: Handling Unsupported Arrow Types in Parquet

Reply via email to