jorisvandenbossche commented on issue #39489:
URL: https://github.com/apache/arrow/issues/39489#issuecomment-1884416205
To make it concrete: assume I have a file written with pyarrow 11.0 (I don't
know exactly at which version number we started supporting LogicalType, but I
had this version available, and here we still only write ConvertedType), for
data in Arrow with tz-naive timestamps:
```python
>>> import pyarrow as pa
>>> import pyarrow.parquet as pq
>>> pa.__version__
'0.11.1'
>>> table = pa.Table.from_arrays([pa.array([1, 2, 3], pa.timestamp("us"))],
['col'])
>>> pq.write_table(table, "test_timestamp_pa011.parquet")
```
and then we read it with latest released pyarrow, we faithfully roundtrip
the Arrow data:
```python
>>> pa.__version__
'14.0.1'
>>> pq.read_metadata("test_timestamp_pa011.parquet").schema
<pyarrow._parquet.ParquetSchema object at 0x7ff21be310c0>
required group field_id=-1 schema {
optional int64 field_id=-1 col (Timestamp(isAdjustedToUTC=true,
timeUnit=microseconds, is_from_converted_type=true,
force_set_converted_type=false));
}
>>> pq.read_metadata("test_timestamp_pa011.parquet").schema.to_arrow_schema()
col: timestamp[us]
>>> pq.read_table("test_timestamp_pa011.parquet")
pyarrow.Table
col: timestamp[us]
----
col: [[1970-01-01 00:00:00.000001,1970-01-01 00:00:00.000002,1970-01-01
00:00:00.000003]]
```
But now with latest main (after https://github.com/apache/arrow/pull/39491),
the read data is no longer the same as the original data (the type now says
that the data is in UTC, while the original data were local timestamps in an
unknown timezone):
```python
>>> pa.__version__
'15.0.0.dev402+gc752bdb08'
>>> pq.read_metadata("test_timestamp_pa011.parquet").schema
<pyarrow._parquet.ParquetSchema object at 0x7f6969c44c80>
required group field_id=-1 schema {
optional int64 field_id=-1 col (Timestamp(isAdjustedToUTC=true,
timeUnit=microseconds, is_from_converted_type=true,
force_set_converted_type=false));
}
>>> pq.read_metadata("test_timestamp_pa011.parquet").schema.to_arrow_schema()
col: timestamp[us, tz=UTC]
>>> pq.read_table("test_timestamp_pa011.parquet")
pyarrow.Table
col: timestamp[us, tz=UTC]
----
col: [[1970-01-01 00:00:00.000001Z,1970-01-01 00:00:00.000002Z,1970-01-01
00:00:00.000003Z]]
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]