[jira] [Comment Edited] (ARROW-16184) [Python] Incorrect Timestamp Unit in Embedded Arrow Schema Within Parquet

Joris Van den Bossche (Jira) Fri, 15 Apr 2022 06:50:07 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-16184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17522835#comment-17522835
 ]


Joris Van den Bossche edited comment on ARROW-16184 at 4/15/22 1:49 PM:
------------------------------------------------------------------------

There is a "Roundtripping Arrow types" section in the Arrow parquet docs: 
https://arrow.apache.org/docs/dev/cpp/parquet.html#roundtripping-arrow-types  
(we should probably update that with an example for timestamp as well, instead 
of only the LargeList example, to make this clearer)

> That being said it seems odd to me that the less expressive schema would be 
> treated as the authoritative one - if you can't trust the arrow schema, what 
> is the point in embedding it?

Note that it _can_ be trusted, but for what it is meant for: to be a 
description of the original Arrow schema, and _not_ for a description of what 
is in the Parquet file / for the Parquet schema. 
When reading the actual parquet data and restoring information from the Arrow 
schema, you still need to do a proper conversion of the Parquet data to a 
potentially different Arrow type. It is up to the reader implementation to what 
extent you want to restore information of the stored Arrow schema (and to do 
this correctly).




was (Author: jorisvandenbossche):
There is a "Roundtripping Arrow types" section in the Arrow parquet docs: 
https://arrow.apache.org/docs/dev/cpp/parquet.html#roundtripping-arrow-types  
(we should probably update that with an example for timestamp as well, instead 
of only the LargeList example, to make this clearer)

> That being said it seems odd to me that the less expressive schema would be 
> treated as the authoritative one - if you can't trust the arrow schema, what 
> is the point in embedding it?

Note that it _can_ be trusted, but for what it is meant for: to be a 
description of the original Arrow schema, and _not_ for a description of what 
is in the Parquet file / for the Parquet schema. 
When reading the actual parquet data and restoring information from the Arrow 
schema, you still need to do a proper conversion of the Parquet data to a 
potentially different Arrow type.



> [Python] Incorrect Timestamp Unit in Embedded Arrow Schema Within Parquet
> -------------------------------------------------------------------------
>
>                 Key: ARROW-16184
>                 URL: https://issues.apache.org/jira/browse/ARROW-16184
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Raphael Taylor-Davies
>            Priority: Minor
>
> As pointed out in https://issues.apache.org/jira/browse/ARROW-2429 the 
> following code results in the schema changing when reading/writing a parquet 
> file.
> {code:python}
> #!/usr/bin/env python
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> # create DataFrame with a datetime column
> df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']})
> df['created'] = pd.to_datetime(df['created'])
> # create Arrow table from DataFrame
> table = pa.Table.from_pandas(df, preserve_index=False)
> # write the table as a parquet file, then read it back again
> pq.write_table(table, 'foo.parquet')
> table2 = pq.read_table('foo.parquet')
> print(table.schema[0])  # pyarrow.Field<created: timestamp[ns]> (nanosecond 
> units)
> print(table2.schema[0]) # pyarrow.Field<created: timestamp[us]> (microsecond 
> units)
> {code}
> This was closed as a limitation of the parquet 1.x format for representing 
> nanosecond timestamps. This is fine, however, the arrow schema embedded 
> within the parquet metadata still lists the data as being a nanosecond array. 
> This causes issues depending on which schema the reader opts to "trust".
> This was discovered as part of the investigation into a bug report on the 
> arrow-rs parquet implementation - 
> [https://github.com/apache/arrow-rs/issues/1459]
> Specifically the metadata written is
> {code:java}
> Schema {
>     endianness: Little,
>     fields: Some(
>         [
>             Field {
>                 name: Some(
>                     "created",
>                 ),
>                 nullable: true,
>                 type_type: Timestamp,
>                 type_: Timestamp {
>                     unit: NANOSECOND,
>                     timezone: Some(
>                         "UTC",
>                     ),
>                 },
>                 dictionary: None,
>                 children: Some(
>                     [],
>                 ),
>                 custom_metadata: None,
>             },
>         ],
>     ),
>     custom_metadata: Some(
>         [
>             KeyValue {
>                 key: Some(
>                     "pandas",
>                 ),
>                 value: Some(
>                     "{\"index_columns\": [], \"column_indexes\": [], 
> \"columns\": [{\"name\": \"created\", \"field_name\": \"created\", 
> \"pandas_type\": \"datetimetz\", \"numpy_type\": \"datetime64[ns]\", 
> \"metadata\": {\"timezone\": \"UTC\"}}], \"creator\": {\"library\": 
> \"pyarrow\", \"version\": \"6.0.1\"}, \"pandas_version\": \"1.4.0\"}",
>                 ),
>             },
>         ],
>     ),
>     features: None,
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Comment Edited] (ARROW-16184) [Python] Incorrect Timestamp Unit in Embedded Arrow Schema Within Parquet

Reply via email to