[
https://issues.apache.org/jira/browse/ARROW-4967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Joris Van den Bossche updated ARROW-4967:
-----------------------------------------
Labels: parquet (was: )
> Object type and stats lost when using 96-bit timestamps
> -------------------------------------------------------
>
> Key: ARROW-4967
> URL: https://issues.apache.org/jira/browse/ARROW-4967
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.12.1
> Environment: PyArrow: 0.12.1
> Python: 2.7.15, 3.7.2
> Pandas: 0.24.2
> Reporter: Diego Argueta
> Priority: Minor
> Labels: parquet
>
> Run the following code:
> {code:python}
> import datetime as dt
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> dataframe = pd.DataFrame({'foo': [dt.datetime.now()]})
> table = pa.Table.from_pandas(dataframe, preserve_index=False)
> pq.write_table(table, 'int64.parq')
> pq.write_table(table, 'int96.parq', use_deprecated_int96_timestamps=True)
> {code}
> Examining the {{int64.parq}} file, we see that the column metadata includes
> an object type of {{TIMESTAMP_MICROS}} and also gives some stats. All is well.
> {code}
> file schema: schema
> --------------------------------------------------------------------------------
> foo: OPTIONAL INT64 O:TIMESTAMP_MICROS R:0 D:1
> row group 1: RC:1 TS:76 OFFSET:4
> --------------------------------------------------------------------------------
> foo: INT64 SNAPPY ... ST:[min: 2019-12-31T23:59:59.999000, max:
> 2019-12-31T23:59:59.999000, num_nulls: 0]
> {code}
> However, if we look at {{int96.parq}}, it appears that that metadata is lost.
> No object type, and no column stats.
> {code}
> file schema: schema
> --------------------------------------------------------------------------------
> foo: OPTIONAL INT96 R:0 D:1
> row group 1: RC:1 TS:58 OFFSET:4
> --------------------------------------------------------------------------------
> foo: INT96 SNAPPY ... ST:[no stats for this column]
> {code}
> This is a bit confusing since the metadata for the exact same data can look
> differently depending on an unrelated flag being set or cleared.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)