tustvold opened a new issue, #1666: URL: https://github.com/apache/arrow-rs/issues/1666
# Problem Data in parquet is stored as one of a limited number of physical types, there are then three mechanisms that an arrow reader can use to infer the type of the data. 1. The deprecated ConvertedType enumeration stored within the parquet schema 2. The [LogicalType](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md) enumeration stored within the parquet schema 3. An embedded arrow schema stored within the parquet file metadata All parquet readers support 1, all v2 readers support 2, and only arrow readers support 3. In some cases the logical type is not necessary to correctly interpret the data, e.g. String vs LargeString, but in some cases it fundamentally alters the meaning of the data. ## Timestamp A nanosecond time units is not supported in 1, and a second time unit is only natively supported in 3. Currently this crate encodes second time units as logical type of milliseconds, this is likely a bug. It also does not convert to nanoseconds to microseconds when using writer version 1, despite this only being supported in >2.6. The python implementation will, depending on the writer version, potentially cast nanoseconds to microseconds, and seconds to milliseconds - see [here](https://arrow.apache.org/docs/dev/python/parquet.html?highlight=round%20tripping#storing-timestamps) There does not appear to be a way to round-trip timezones in 1 or 2. The C++ implementation appears to always normalize timezones to UTC and set is_adjusted_to_utc to true. Currently this crate sets is_adjusted_to_utc to true if the timezone is set, this is despite the writer not actually performing the normalisation. I think this is a bug. ## Date64 The arrow type Date64 is milliseconds since epoch and does not have an equivalent ConvertedType nor LogicalType. Currently this crate converts the type to Date32 on write, losing sub-second precision. This what the C++ implementation does - see [here](https://arrow.apache.org/docs/dev/cpp/parquet.html#logical-types). ## Interval The interval data type has a ConvertedType but not a LogicalType, there is a PR to add LogicalType support https://github.com/apache/parquet-format/pull/165 but it appears to have stalled somewhat. An interval of MonthDayNano cannot be represented by the ConvertedType. The C++ implementation does not appear to support Interval types. # Proposal There are broadly speaking 3 ways to handle data types that cannot be represented in the parquet schema: * Return an error * Cast to a native parquet representation, potentially losing precision in the process * Encode the data, only encoding its logical type in the embedded arrow schema Returning an error is the safest, but is not a great UX. Casting to a native parquet representation is inline with what other arrow implementations do and gives the best ecosystem compatibility, but also doesn't make for a great UX, just search StackOverflow for `allow_truncated_timestamps` to see the confusion this causes. The final option is the simplest to implement, the least surprising to users, and what I would propose implementing. It would break ecosystem interoperability in certain cases, but I think it is more important that we faithfully round-trip data than maintain maximal compatibility. Users who then care about interoperability can explicitly cast data as appropriate, similar to python's `coerce_timestamps` functionality. Thoughts @sunchao @nevi-me @alamb @jorisvandenbossche @jorgecarleitao # Additional Context * #1459 * #1275 * #1655 * #1663 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
