[jira] [Commented] (ARROW-16184) [Python] Incorrect Timestamp Unit in Embedded Arrow Schema Within Parquet

Joris Van den Bossche (Jira) Fri, 15 Apr 2022 06:34:06 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-16184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17522831#comment-17522831
 ]


Joris Van den Bossche commented on ARROW-16184:
-----------------------------------------------

> however, the arrow schema embedded within the parquet metadata still lists 
> the data as being a nanosecond array. This causes issues depending on which 
> schema the reader opts to "trust".

[~tustvold] I think you need to see the stored Arrow schema as "the schema of 
the original Arrow data" that was used to write the Parquet file. In that 
sense, the schema _is_ correct (the original Arrow data _did_ have nanosecond 
resolution). 
So that also means that this stored Arrow schema doesn't necessarily say 
anything about the data that is actually stored in the Parquet file. This is 
one example where there is a difference, but there are also other examples (eg 
extension types, duration, fixed sized list, ... are all types that are not 
directly supported in parquet, and thus would give a difference between the 
Parquet schema and the stored Arrow schema)

> [Python] Incorrect Timestamp Unit in Embedded Arrow Schema Within Parquet
> -------------------------------------------------------------------------
>
>                 Key: ARROW-16184
>                 URL: https://issues.apache.org/jira/browse/ARROW-16184
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Raphael Taylor-Davies
>            Priority: Minor
>
> As pointed out in https://issues.apache.org/jira/browse/ARROW-2429 the 
> following code results in the schema changing when reading/writing a parquet 
> file.
> {code:python}
> #!/usr/bin/env python
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> # create DataFrame with a datetime column
> df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']})
> df['created'] = pd.to_datetime(df['created'])
> # create Arrow table from DataFrame
> table = pa.Table.from_pandas(df, preserve_index=False)
> # write the table as a parquet file, then read it back again
> pq.write_table(table, 'foo.parquet')
> table2 = pq.read_table('foo.parquet')
> print(table.schema[0])  # pyarrow.Field<created: timestamp[ns]> (nanosecond 
> units)
> print(table2.schema[0]) # pyarrow.Field<created: timestamp[us]> (microsecond 
> units)
> {code}
> This was closed as a limitation of the parquet 1.x format for representing 
> nanosecond timestamps. This is fine, however, the arrow schema embedded 
> within the parquet metadata still lists the data as being a nanosecond array. 
> This causes issues depending on which schema the reader opts to "trust".
> This was discovered as part of the investigation into a bug report on the 
> arrow-rs parquet implementation - 
> [https://github.com/apache/arrow-rs/issues/1459]
> Specifically the metadata written is
> {code:java}
> Schema {
>     endianness: Little,
>     fields: Some(
>         [
>             Field {
>                 name: Some(
>                     "created",
>                 ),
>                 nullable: true,
>                 type_type: Timestamp,
>                 type_: Timestamp {
>                     unit: NANOSECOND,
>                     timezone: Some(
>                         "UTC",
>                     ),
>                 },
>                 dictionary: None,
>                 children: Some(
>                     [],
>                 ),
>                 custom_metadata: None,
>             },
>         ],
>     ),
>     custom_metadata: Some(
>         [
>             KeyValue {
>                 key: Some(
>                     "pandas",
>                 ),
>                 value: Some(
>                     "{\"index_columns\": [], \"column_indexes\": [], 
> \"columns\": [{\"name\": \"created\", \"field_name\": \"created\", 
> \"pandas_type\": \"datetimetz\", \"numpy_type\": \"datetime64[ns]\", 
> \"metadata\": {\"timezone\": \"UTC\"}}], \"creator\": {\"library\": 
> \"pyarrow\", \"version\": \"6.0.1\"}, \"pandas_version\": \"1.4.0\"}",
>                 ),
>             },
>         ],
>     ),
>     features: None,
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-16184) [Python] Incorrect Timestamp Unit in Embedded Arrow Schema Within Parquet

Reply via email to