Diego Argueta created ARROW-4967:
------------------------------------
Summary: Object type and stats lost when using 96-bit timestamps
Key: ARROW-4967
URL: https://issues.apache.org/jira/browse/ARROW-4967
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 0.12.1
Environment: PyArrow: 0.12.1
Python: 2.7.15, 3.7.2
Pandas: 0.24.2
Reporter: Diego Argueta
Run the following code:
{code:python}
import datetime as dt
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
dataframe = pd.DataFrame({'foo': [dt.datetime.now()]})
table = pa.Table.from_pandas(dataframe, preserve_index=False)
pq.write_table(table, 'int64.parq')
pq.write_table(table, 'int96.parq', use_deprecated_int96_timestamps=True)
{code}
Examining the {{int64.parq}} file, we see that the column metadata includes an
object type of {{TIMESTAMP_MICROS}} and also gives some stats. All is well.
{code}
file schema: schema
--------------------------------------------------------------------------------
foo: OPTIONAL INT64 O:TIMESTAMP_MICROS R:0 D:1
row group 1: RC:1 TS:76 OFFSET:4
--------------------------------------------------------------------------------
foo: INT64 SNAPPY ... ST:[min: 2019-12-31T23:59:59.999000, max:
2019-12-31T23:59:59.999000, num_nulls: 0]
{code}
However, if we look at {{int96.parq}}, it appears that that metadata is lost.
No object type, and no column stats.
{code}
file schema: schema
--------------------------------------------------------------------------------
foo: OPTIONAL INT96 R:0 D:1
row group 1: RC:1 TS:58 OFFSET:4
--------------------------------------------------------------------------------
foo: INT96 SNAPPY ... ST:[no stats for this column]
{code}
This is a bit confusing since the metadata for the exact same data can look
differently depending on an unrelated flag being set or cleared.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)