tustvold opened a new issue, #1666:
URL: https://github.com/apache/arrow-rs/issues/1666

   # Problem
   
   Data in parquet is stored as one of a limited number of physical types, 
there are then three mechanisms that an arrow reader can use to infer the type 
of the data.
   
   1. The deprecated ConvertedType enumeration stored within the parquet schema
   2. The 
[LogicalType](https://github.com/apache/parquet-format/blob/master/LogicalTypes.md)
 enumeration stored within the parquet schema
   3. An embedded arrow schema stored within the parquet file metadata
   
   All parquet readers support 1, all v2 readers support 2, and only arrow 
readers support 3.
   
   In some cases the logical type is not necessary to correctly interpret the 
data, e.g. String vs LargeString, but in some cases it fundamentally alters the 
meaning of the data.
   
   ## Timestamp
   
   A nanosecond time units is not supported in 1, and a second time unit is 
only natively supported in 3.
   
   Currently this crate encodes second time units as logical type of 
milliseconds, this is likely a bug. It also does not convert to nanoseconds to 
microseconds when using writer version 1, despite this only being supported in 
>2.6.
   
   The python implementation will, depending on the writer version, potentially 
cast nanoseconds to microseconds, and seconds to milliseconds - see 
[here](https://arrow.apache.org/docs/dev/python/parquet.html?highlight=round%20tripping#storing-timestamps)
   
   There does not appear to be a way to round-trip timezones in 1 or 2. The C++ 
implementation appears to always normalize timezones to UTC and set 
is_adjusted_to_utc to true.
   
   Currently this crate sets is_adjusted_to_utc to true if the timezone is set, 
this is despite the writer not actually performing the normalisation. I think 
this is a bug.
   
   ## Date64
   
   The arrow type Date64 is milliseconds since epoch and does not have an 
equivalent ConvertedType nor LogicalType.
   
   Currently this crate converts the type to Date32 on write, losing sub-second 
precision. This what the C++ implementation does - see 
[here](https://arrow.apache.org/docs/dev/cpp/parquet.html#logical-types).
   
   ## Interval
   
   The interval data type has a ConvertedType but not a LogicalType, there is a 
PR to add LogicalType support https://github.com/apache/parquet-format/pull/165 
but it appears to have stalled somewhat.
   
   An interval of MonthDayNano cannot be represented by the ConvertedType. The 
C++ implementation does not appear to support Interval types.
   
   # Proposal
   
   There are broadly speaking 3 ways to handle data types that cannot be 
represented in the parquet schema:
   
   * Return an error
   * Cast to a native parquet representation, potentially losing precision in 
the process
   * Encode the data, only encoding its logical type in the embedded arrow 
schema
   
   Returning an error is the safest, but is not a great UX. Casting to a native 
parquet representation is inline with what other arrow implementations do and 
gives the best ecosystem compatibility, but also doesn't make for a great UX, 
just search StackOverflow for `allow_truncated_timestamps` to see the confusion 
this causes.
   
   The final option is the simplest to implement, the least surprising to 
users, and what I would propose implementing. It would break ecosystem 
interoperability in certain cases, but I think it is more important that we 
faithfully round-trip data than maintain maximal compatibility. Users who then 
care about interoperability can explicitly cast data as appropriate, similar to 
python's `coerce_timestamps` functionality.
   
   Thoughts @sunchao @nevi-me @alamb @jorisvandenbossche @jorgecarleitao 
   
   # Additional Context
   
   * #1459 
   * #1275
   * #1655 
   * #1663 
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to