paul hess created ARROW-8100:
--------------------------------
Summary: timestamp[ms] and date64 data types not working as
expected on write
Key: ARROW-8100
URL: https://issues.apache.org/jira/browse/ARROW-8100
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 0.15.1
Reporter: paul hess
I expect that either timestamp[ms] or date64 will give me a millisecond
presicion datetime/timestamp as written to a parquet file, instead this is the
behavior I see:
>>> arr = pa.array([datetime(2020, 12, 20)])
>>> arr.cast(pa.timestamp('ms'), safe=False)
<pyarrow.lib.TimestampArray object at 0x117f3d4c8>
[
2020-12-20 00:00:00.000
]>>> table = pa.Table.from_arrays([arr], names=["start_date"])>>> table
pyarrow.Table
start_date: timestamp[us]# just to make sure>>>
table.column("start_date").cast(pa.timestamp('ms'), safe=False)
<pyarrow.lib.ChunkedArray object at 0x117f5e9a8>
[
[
2020-12-20 00:00:00.000
]
]# just to make extra sure>>> schema = pa.schema([pa.field("start_date",
pa.timestamp("ms"))])
>>> table.cast(schema, safe=False)parquet.write_table(table,
>>> "sldkfjasldkfj.parquet", coerce_timestamps="ms", compression="SNAPPY",
>>> allow_truncated_timestamps=True)
Result for the written file:
Schema:
{quote}{
"type" : "record",
"name" : "schema",
"fields" : [ {
"name" : "start_date",
"type" : [ "null", {
"type" : "long",
"logicalType" : "timestamp-millis"
} ],
"default" : null
} ]
}
{quote}
Data:
||start_date|| ||
|1608422400000| |
that is a microsecond [us] value, despite casting to [ms] and setting the
appropriate config on the write_table method. If it was a millisecond timestamp
it would be accurate to translate back to a datetime with fromtimestamp, but:
>>> from datetime import datetime
>>>
>>>
>>>
>>>
>>> datetime.fromtimestamp(1608422400000)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
ValueError: year 52938 is out of range
>>> datetime.fromtimestamp(1608422400000 /1000)
datetime.datetime(2020, 12, 19, 16, 0)
Ok, so then we should use date64() type, after all the docs say *_Create
instance of 64-bit date (milliseconds since UNIX epoch 1970-01-01)_*
>>> arr = pa.array([datetime(2020, 12, 20, 0, 0, 0, 123)], type=pa.date64())
>>> arr
<pyarrow.lib.Date64Array object at 0x11da877c8>
[
2020-12-20
]>>> table = pa.Table.from_arrays([arr], names=["start_date"])
>>> table
pyarrow.Table
start_date: date64[ms]parquet.write_table(table,
"/Users/hessp/ddt/rest-ingress/bebedabeep.parquet", coerce_timestamps="ms",
compression="SNAPPY", allow_truncated_timestamps=True)
Result for the written file:
Schema:
{quote}{
"type" : "record",
"name" : "schema",
"fields" : [ {
"name" : "start_date",
"type" : [ "null", {
"type" : "int",
"logicalType" : "date"
} ],
"default" : null
} ]
}
{quote}
Data:
||start_date|| ||
|18616| |
That is "days since UNIX epoch 1970-01-01" just like date32() type, the time
info is stripped off, we can confirm this:
>>> arr.to_pylist()
[datetime.date(2020, 12, 20)]
How do I write a millisecond precision timestamp with pyarrow.parquet?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)