Théophile Chevalier created ARROW-7747:
------------------------------------------
Summary: [Python] coerce_timestamps + allow_truncated_timestamps
does not work as expected with nanoseconds
Key: ARROW-7747
URL: https://issues.apache.org/jira/browse/ARROW-7747
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 0.15.1
Reporter: Théophile Chevalier
Hi,
I've encountered what seems to me a bug using:
{noformat}
pyarrow==0.15.1
pandas==0.25.3
numpy==1.18.1{noformat}
{code:python}
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
import numpy as np
pyarrow_schema = pa.schema([pa.field("datetime_ms", pa.timestamp("ms"))])
timestamp = np.datetime64("2019-06-21T22:13:02.901123")
d = {"datetime_ms": timestamp}
df = pd.DataFrame(d, index=range(1))
table = pa.Table.from_pandas(df, schema=pyarrow_schema)
pq.write_table(
table,
"test.parquet",
coerce_timestamps="ms",
allow_truncated_timestamps=True,
)
{code}
{noformat}
pyarrow.lib.ArrowInvalid: ('Casting from timestamp[ns] to timestamp[ms] would
lose data: 1561155182901123000', 'Conversion failed for column datetime_ms with
type datetime64[ns]'){noformat}
>From my understanding, the expected behaviour shoud be arrow allowing the
>conversion anyway, even if loosing some data.
Related discussions:
- https://github.com/apache/arrow/issues/1920
- https://issues.apache.org/jira/browse/ARROW-2555
This test
https://github.com/apache/arrow/blob/f70dbd1dbdb51a47e6a8a8aac8efd40ccf4d44f2/python/pyarrow/tests/test_parquet.py#L846
does not explicitely check for nanosecond timestamps.
To be honest I've not checked at the code yet, so let me know whether I missed
something. I'd be happy to fix it if it's really a bug.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)