Diego Argueta created ARROW-4967: ------------------------------------ Summary: Object type and stats lost when using 96-bit timestamps Key: ARROW-4967 URL: https://issues.apache.org/jira/browse/ARROW-4967 Project: Apache Arrow Issue Type: Bug Components: Python Affects Versions: 0.12.1 Environment: PyArrow: 0.12.1 Python: 2.7.15, 3.7.2 Pandas: 0.24.2 Reporter: Diego Argueta
Run the following code: {code:python} import datetime as dt import pandas as pd import pyarrow as pa import pyarrow.parquet as pq dataframe = pd.DataFrame({'foo': [dt.datetime.now()]}) table = pa.Table.from_pandas(dataframe, preserve_index=False) pq.write_table(table, 'int64.parq') pq.write_table(table, 'int96.parq', use_deprecated_int96_timestamps=True) {code} Examining the {{int64.parq}} file, we see that the column metadata includes an object type of {{TIMESTAMP_MICROS}} and also gives some stats. All is well. {code} file schema: schema -------------------------------------------------------------------------------- foo: OPTIONAL INT64 O:TIMESTAMP_MICROS R:0 D:1 row group 1: RC:1 TS:76 OFFSET:4 -------------------------------------------------------------------------------- foo: INT64 SNAPPY ... ST:[min: 2019-12-31T23:59:59.999000, max: 2019-12-31T23:59:59.999000, num_nulls: 0] {code} However, if we look at {{int96.parq}}, it appears that that metadata is lost. No object type, and no column stats. {code} file schema: schema -------------------------------------------------------------------------------- foo: OPTIONAL INT96 R:0 D:1 row group 1: RC:1 TS:58 OFFSET:4 -------------------------------------------------------------------------------- foo: INT96 SNAPPY ... ST:[no stats for this column] {code} This is a bit confusing since the metadata for the exact same data can look differently depending on an unrelated flag being set or cleared. -- This message was sent by Atlassian JIRA (v7.6.3#76005)