Re: [I] [C++][Parquet] Timestamp conversion from Parquet to Arrow does not follow compatibility guidelines for convertedType [arrow]

via GitHub Wed, 10 Jan 2024 03:31:01 -0800


joellubi commented on issue #39489:
URL: https://github.com/apache/arrow/issues/39489#issuecomment-1884672151


   Thank you for adding those details @jorisvandenbossche, it really helped 
clarify the context of the previous decision. In thinking about this, it seems 
there may actually be two separate but related goals for the conversion:
   
   1. Preservation of the original Parquet semantics when converting to Arrow
   2. Preservation of the original Arrow type when a roundtrip occurs through 
Parquet and back to Arrow
   
   Since the legacy Parquet types cannot distinguish between instant and local 
semantics, consumers have had to make an arbitrary choice between the two as 
you described. It does seem, at least now, that the Parquet spec has an 
explicit position that the deprecated convertedType timestamps always had 
instant semantics, and that local semantics were unsupported 
([source](https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md#deprecated-timestamp-convertedtype)).
 This helps make the current position clear but doesn't change the reality of 
real-world Parquet usage that has not followed these semantics.
   
   In the case that a Parquet file is being read and we don't know (or care) 
how it was produced, I do think it makes sense to follow compatibility 
guidelines (as the recently-merged PR has). This does break roundtrip behavior 
(thank you for pointing this out), but I do think Arrow already has an 
independent solution to this case. The metadata associated with the 
`ARROW:schema` key _should_ be able to preserve this information on roundtrip, 
and I was actually surprised that it didn't in your example. I found 
[this](https://github.com/apache/arrow/blob/72ed58449ea71aab1343d9adce19f177f20705cf/cpp/src/parquet/arrow/schema.cc#L922-L933)
 relevant code for how we currently handle the restoration of the original 
type. It seems that we will restore a particular timezone if it was present in 
the original schema, but _do not_ remove the timezone if one was not present. 
Perhaps by independently updating this logic we can get the desired behavior 
for both scenarios.
   
   Desired behavior IMO:
   ```python
   >>> 
pq.read_metadata("test_timestamp_convertedtype_writer_unknown.parquet").schema.to_arrow_schema()
   col: timestamp[us, tz=UTC]
   
   >>> 
pq.read_metadata("test_timestamp_convertedtype_written_by_arrow_no_tz.parquet").schema.to_arrow_schema()
   col: timestamp[us]
   
   >>> 
pq.read_metadata("test_timestamp_convertedtype_written_by_arrow_with_tz.parquet").schema.to_arrow_schema()
   col: timestamp[us, tz=UTC]
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] [C++][Parquet] Timestamp conversion from Parquet to Arrow does not follow compatibility guidelines for convertedType [arrow]

Reply via email to