[jira] [Updated] (ARROW-16184) [Python] Incorrect Timestamp Unit in Embedded Arrow Schema Within Parquet

Raphael Taylor-Davies (Jira) Wed, 13 Apr 2022 03:34:05 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-16184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Raphael Taylor-Davies updated ARROW-16184:
------------------------------------------
    Description: 
As pointed out in https://issues.apache.org/jira/browse/ARROW-2429 the 
following code results in the schema changing when reading/writing a parquet 
file.
{code:python}
#!/usr/bin/env python

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd

# create DataFrame with a datetime column
df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']})
df['created'] = pd.to_datetime(df['created'])

# create Arrow table from DataFrame
table = pa.Table.from_pandas(df, preserve_index=False)

# write the table as a parquet file, then read it back again
pq.write_table(table, 'foo.parquet')
table2 = pq.read_table('foo.parquet')

print(table.schema[0])  # pyarrow.Field<created: timestamp[ns]> (nanosecond 
units)
print(table2.schema[0]) # pyarrow.Field<created: timestamp[us]> (microsecond 
units)
{code}
This was closed as a limitation of the parquet 1.x format for representing 
nanosecond timestamps. This is fine, however, the arrow schema embedded within 
the parquet metadata still lists the data as being a nanosecond array. This 
causes issues depending on which schema the reader opts to "trust".

This was discovered as part of the investigation into a bug report on the 
arrow-rs parquet implementation - 
[https://github.com/apache/arrow-rs/issues/1459]

Specifically the metadata written is
{code:java}
Schema {
    endianness: Little,
    fields: Some(
        [
            Field {
                name: Some(
                    "created",
                ),
                nullable: true,
                type_type: Timestamp,
                type_: Timestamp {
                    unit: NANOSECOND,
                    timezone: Some(
                        "UTC",
                    ),
                },
                dictionary: None,
                children: Some(
                    [],
                ),
                custom_metadata: None,
            },
        ],
    ),
    custom_metadata: Some(
        [
            KeyValue {
                key: Some(
                    "pandas",
                ),
                value: Some(
                    "{\"index_columns\": [], \"column_indexes\": [], 
\"columns\": [{\"name\": \"created\", \"field_name\": \"created\", 
\"pandas_type\": \"datetimetz\", \"numpy_type\": \"datetime64[ns]\", 
\"metadata\": {\"timezone\": \"UTC\"}}], \"creator\": {\"library\": 
\"pyarrow\", \"version\": \"6.0.1\"}, \"pandas_version\": \"1.4.0\"}",
                ),
            },
        ],
    ),
    features: None,
} {code}

  was:
As pointed out in https://issues.apache.org/jira/browse/ARROW-2429 the 
following code results in the schema changing when reading/writing a parquet 
file.
{code:python}
#!/usr/bin/env python

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd

# create DataFrame with a datetime column
df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']})
df['created'] = pd.to_datetime(df['created'])

# create Arrow table from DataFrame
table = pa.Table.from_pandas(df, preserve_index=False)

# write the table as a parquet file, then read it back again
pq.write_table(table, 'foo.parquet')
table2 = pq.read_table('foo.parquet')

print(table.schema[0])  # pyarrow.Field<created: timestamp[ns]> (nanosecond 
units)
print(table2.schema[0]) # pyarrow.Field<created: timestamp[us]> (microsecond 
units)
{code}
This was closed as a limitation of the parquet 1.x format for representing 
nanosecond timestamps. This is fine, however, the arrow schema embedded within 
the parquet metadata still lists the data as being a nanosecond array. This 
causes issues depending on which schema the reader opts to "trust".

This was discovered as part of the investigation into a bug report on the 
arrow-rs parquet implementation - 
[https://github.com/apache/arrow-rs/issues/1459]


> [Python] Incorrect Timestamp Unit in Embedded Arrow Schema Within Parquet
> -------------------------------------------------------------------------
>
>                 Key: ARROW-16184
>                 URL: https://issues.apache.org/jira/browse/ARROW-16184
>             Project: Apache Arrow
>          Issue Type: Bug
>            Reporter: Raphael Taylor-Davies
>            Priority: Minor
>
> As pointed out in https://issues.apache.org/jira/browse/ARROW-2429 the 
> following code results in the schema changing when reading/writing a parquet 
> file.
> {code:python}
> #!/usr/bin/env python
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> # create DataFrame with a datetime column
> df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']})
> df['created'] = pd.to_datetime(df['created'])
> # create Arrow table from DataFrame
> table = pa.Table.from_pandas(df, preserve_index=False)
> # write the table as a parquet file, then read it back again
> pq.write_table(table, 'foo.parquet')
> table2 = pq.read_table('foo.parquet')
> print(table.schema[0])  # pyarrow.Field<created: timestamp[ns]> (nanosecond 
> units)
> print(table2.schema[0]) # pyarrow.Field<created: timestamp[us]> (microsecond 
> units)
> {code}
> This was closed as a limitation of the parquet 1.x format for representing 
> nanosecond timestamps. This is fine, however, the arrow schema embedded 
> within the parquet metadata still lists the data as being a nanosecond array. 
> This causes issues depending on which schema the reader opts to "trust".
> This was discovered as part of the investigation into a bug report on the 
> arrow-rs parquet implementation - 
> [https://github.com/apache/arrow-rs/issues/1459]
> Specifically the metadata written is
> {code:java}
> Schema {
>     endianness: Little,
>     fields: Some(
>         [
>             Field {
>                 name: Some(
>                     "created",
>                 ),
>                 nullable: true,
>                 type_type: Timestamp,
>                 type_: Timestamp {
>                     unit: NANOSECOND,
>                     timezone: Some(
>                         "UTC",
>                     ),
>                 },
>                 dictionary: None,
>                 children: Some(
>                     [],
>                 ),
>                 custom_metadata: None,
>             },
>         ],
>     ),
>     custom_metadata: Some(
>         [
>             KeyValue {
>                 key: Some(
>                     "pandas",
>                 ),
>                 value: Some(
>                     "{\"index_columns\": [], \"column_indexes\": [], 
> \"columns\": [{\"name\": \"created\", \"field_name\": \"created\", 
> \"pandas_type\": \"datetimetz\", \"numpy_type\": \"datetime64[ns]\", 
> \"metadata\": {\"timezone\": \"UTC\"}}], \"creator\": {\"library\": 
> \"pyarrow\", \"version\": \"6.0.1\"}, \"pandas_version\": \"1.4.0\"}",
>                 ),
>             },
>         ],
>     ),
>     features: None,
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-16184) [Python] Incorrect Timestamp Unit in Embedded Arrow Schema Within Parquet

Reply via email to