[
https://issues.apache.org/jira/browse/ARROW-8100?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17058315#comment-17058315
]
paul hess edited comment on ARROW-8100 at 3/13/20, 12:37 AM:
-------------------------------------------------------------
You are correct [~wesm] I should have used utcfromtimestamp in my example. The
offset difference is not the issue I am trying to present however, the issue is
that the output is not 1608422400 but 1608422400000 which is not the expected
millisecond precision timestamp but the microsecond precision
Data:
||start_date|| ||
|1608422400000| |
was (Author: phess):
You are correct [~wesm] I should have used utcfromtimestamp in my example. The
offset difference is not the issue I am trying to present however, the issue is
that the output is not 1608422400 but 1608422400000
Data:
||start_date|| ||
|1608422400000| |
> [Python] timestamp[ms] and date64 data types not working as expected on write
> -----------------------------------------------------------------------------
>
> Key: ARROW-8100
> URL: https://issues.apache.org/jira/browse/ARROW-8100
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.16.0, 0.15.1
> Reporter: paul hess
> Priority: Major
>
> I expect that either timestamp[ms] or date64 will give me a millisecond
> presicion datetime/timestamp as written to a parquet file, instead this is
> the behavior I see:
> {{ }}
> >>> arr = pa.array([datetime(2020, 12, 20)])
> (have used pa.array([datetime(2020, 12, 20), type=pa.timestamp('ms')]) with
> no later casting as well)
> >>> arr.cast(pa.timestamp('ms'), safe=False)
> <pyarrow.lib.TimestampArray object at 0x117f3d4c8>
> [
> 2020-12-20 00:00:00.000
> ]
>
> >>> table = pa.Table.from_arrays([arr],
> {{ names=["start_date"])}}
> {{>>> table}}
> pyarrow.Table
> start_date: timestamp[us]
>
> {{// just to make sure}}
>
> {{>>> table.column("start_date").cast(pa.timestamp('ms'), safe=False)}}
> <pyarrow.lib.ChunkedArray object at 0x117f5e9a8>
> [
> [
> 2020-12-20 00:00:00.000
> ]
> ]
>
> {{// just to make extra sure}}
>
> {{>>> schema = pa.schema([pa.field("start_date", pa.timestamp("ms"))])}}
> >>> table.cast(schema, safe=False)parquet.write_table(table,
>
> "sldkfjasldkfj.parquet",
>
> coerce_timestamps="ms",
>
> compression="SNAPPY",
> {{ allow_truncated_timestamps=True)}}
> Result for the written file:
> Schema:
> {
> "type" : "record",
> "name" : "schema",
> "fields" : [ {
> "name" : "start_date",
> "type" : [ "null",
> { "type" : "long", "logicalType" : "timestamp-millis" }
> ],
> "default" : null
> } ]
> }
> Data:
> ||start_date|| ||
> |1608422400000| |
>
> that is a microsecond [us] value, despite casting to [ms] and setting the
> appropriate config on the write_table method. If it was a millisecond
> timestamp it would be accurate to translate back to a datetime with
> fromtimestamp, but:
> >>> from datetime import datetime
> >>>
> >>>
> >>>
> >>>
> >>> datetime.fromtimestamp(1608422400000)
> Traceback (most recent call last):
> File "<stdin>", line 1, in <module>
> ValueError: year 52938 is out of range
> >>> datetime.fromtimestamp(1608422400000 /1000)
> datetime.datetime(2020, 12, 19, 16, 0)
>
>
> Ok, so then we should use date64() type, after all the docs say *_Create
> instance of 64-bit date (milliseconds since UNIX epoch 1970-01-01)_*
>
> >>> arr = pa.array([datetime(2020, 12, 20, 0, 0, 0, 123)], type=pa.date64())
> >>> arr
> <pyarrow.lib.Date64Array object at 0x11da877c8>
> [
> 2020-12-20
> ]
> >>> table = pa.Table.from_arrays([arr], names=["start_date"])
> >>> table
> pyarrow.Table
> start_date: date64[ms]
> parquet.write_table(table,
> "bebedabeep.parquet",
> coerce_timestamps="ms",
> compression="SNAPPY",
> allow_truncated_timestamps=True)
>
>
> Result for the written file:
> Schema:
> {
> "type" : "record",
> "name" : "schema",
> "fields" : [ {
> "name" : "start_date",
> "type" : [ "null",
> { "type" : "int", "logicalType" : "date" }
> ],
> "default" : null
> } ]
> }
> Data:
>
> ||start_date|| ||
> |18616| |
>
> That is "days since UNIX epoch 1970-01-01" just like date32() type, the time
> info is stripped off, we can confirm this:
> >>> arr.to_pylist()
> [datetime.date(2020, 12, 20)]
>
> How do I write a millisecond precision timestamp with pyarrow.parquet?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)