joellubi commented on issue #39489: URL: https://github.com/apache/arrow/issues/39489#issuecomment-1884672151
Thank you for adding those details @jorisvandenbossche, it really helped clarify the context of the previous decision. In thinking about this, it seems there may actually be two separate but related goals for the conversion: 1. Preservation of the original Parquet semantics when converting to Arrow 2. Preservation of the original Arrow type when a roundtrip occurs through Parquet and back to Arrow Since the legacy Parquet types cannot distinguish between instant and local semantics, consumers have had to make an arbitrary choice between the two as you described. It does seem, at least now, that the Parquet spec has an explicit position that the deprecated convertedType timestamps always had instant semantics, and that local semantics were unsupported ([source](https://github.com/apache/parquet-format/blob/eb4b31c1d64a01088d02a2f9aefc6c17c54cc6fc/LogicalTypes.md#deprecated-timestamp-convertedtype)). This helps make the current position clear but doesn't change the reality of real-world Parquet usage that has not followed these semantics. In the case that a Parquet file is being read and we don't know (or care) how it was produced, I do think it makes sense to follow compatibility guidelines (as the recently-merged PR has). This does break roundtrip behavior (thank you for pointing this out), but I do think Arrow already has an independent solution to this case. The metadata associated with the `ARROW:schema` key _should_ be able to preserve this information on roundtrip, and I was actually surprised that it didn't in your example. I found [this](https://github.com/apache/arrow/blob/72ed58449ea71aab1343d9adce19f177f20705cf/cpp/src/parquet/arrow/schema.cc#L922-L933) relevant code for how we currently handle the restoration of the original type. It seems that we will restore a particular timezone if it was present in the original schema, but _do not_ remove the timezone if one was not present. Perhaps by independently updating this logic we can get the desired behavior for both scenarios. Desired behavior IMO: ```python >>> pq.read_metadata("test_timestamp_convertedtype_writer_unknown.parquet").schema.to_arrow_schema() col: timestamp[us, tz=UTC] >>> pq.read_metadata("test_timestamp_convertedtype_written_by_arrow_no_tz.parquet").schema.to_arrow_schema() col: timestamp[us] >>> pq.read_metadata("test_timestamp_convertedtype_written_by_arrow_with_tz.parquet").schema.to_arrow_schema() col: timestamp[us, tz=UTC] ``` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
