[
https://issues.apache.org/jira/browse/ARROW-16184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Raphael Taylor-Davies updated ARROW-16184:
------------------------------------------
Description:
As pointed out in https://issues.apache.org/jira/browse/ARROW-2429 the
following code results in the schema changing when reading/writing a parquet
file.
{code:python}
#!/usr/bin/env python
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
# create DataFrame with a datetime column
df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']})
df['created'] = pd.to_datetime(df['created'])
# create Arrow table from DataFrame
table = pa.Table.from_pandas(df, preserve_index=False)
# write the table as a parquet file, then read it back again
pq.write_table(table, 'foo.parquet')
table2 = pq.read_table('foo.parquet')
print(table.schema[0]) # pyarrow.Field<created: timestamp[ns]> (nanosecond
units)
print(table2.schema[0]) # pyarrow.Field<created: timestamp[us]> (microsecond
units)
{code}
This was closed as a limitation of the parquet 1.x format for representing
nanosecond timestamps. This is fine, however, the arrow schema embedded within
the parquet metadata still lists the data as being a nanosecond array. This
causes issues depending on which schema the reader opts to "trust".
This was discovered as part of the investigation into a bug report on the
arrow-rs parquet implementation -
[https://github.com/apache/arrow-rs/issues/1459]
Specifically the metadata written is
{code:java}
Schema {
endianness: Little,
fields: Some(
[
Field {
name: Some(
"created",
),
nullable: true,
type_type: Timestamp,
type_: Timestamp {
unit: NANOSECOND,
timezone: Some(
"UTC",
),
},
dictionary: None,
children: Some(
[],
),
custom_metadata: None,
},
],
),
custom_metadata: Some(
[
KeyValue {
key: Some(
"pandas",
),
value: Some(
"{\"index_columns\": [], \"column_indexes\": [],
\"columns\": [{\"name\": \"created\", \"field_name\": \"created\",
\"pandas_type\": \"datetimetz\", \"numpy_type\": \"datetime64[ns]\",
\"metadata\": {\"timezone\": \"UTC\"}}], \"creator\": {\"library\":
\"pyarrow\", \"version\": \"6.0.1\"}, \"pandas_version\": \"1.4.0\"}",
),
},
],
),
features: None,
} {code}
was:
As pointed out in https://issues.apache.org/jira/browse/ARROW-2429 the
following code results in the schema changing when reading/writing a parquet
file.
{code:python}
#!/usr/bin/env python
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
# create DataFrame with a datetime column
df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']})
df['created'] = pd.to_datetime(df['created'])
# create Arrow table from DataFrame
table = pa.Table.from_pandas(df, preserve_index=False)
# write the table as a parquet file, then read it back again
pq.write_table(table, 'foo.parquet')
table2 = pq.read_table('foo.parquet')
print(table.schema[0]) # pyarrow.Field<created: timestamp[ns]> (nanosecond
units)
print(table2.schema[0]) # pyarrow.Field<created: timestamp[us]> (microsecond
units)
{code}
This was closed as a limitation of the parquet 1.x format for representing
nanosecond timestamps. This is fine, however, the arrow schema embedded within
the parquet metadata still lists the data as being a nanosecond array. This
causes issues depending on which schema the reader opts to "trust".
This was discovered as part of the investigation into a bug report on the
arrow-rs parquet implementation -
[https://github.com/apache/arrow-rs/issues/1459]
> [Python] Incorrect Timestamp Unit in Embedded Arrow Schema Within Parquet
> -------------------------------------------------------------------------
>
> Key: ARROW-16184
> URL: https://issues.apache.org/jira/browse/ARROW-16184
> Project: Apache Arrow
> Issue Type: Bug
> Reporter: Raphael Taylor-Davies
> Priority: Minor
>
> As pointed out in https://issues.apache.org/jira/browse/ARROW-2429 the
> following code results in the schema changing when reading/writing a parquet
> file.
> {code:python}
> #!/usr/bin/env python
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> # create DataFrame with a datetime column
> df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']})
> df['created'] = pd.to_datetime(df['created'])
> # create Arrow table from DataFrame
> table = pa.Table.from_pandas(df, preserve_index=False)
> # write the table as a parquet file, then read it back again
> pq.write_table(table, 'foo.parquet')
> table2 = pq.read_table('foo.parquet')
> print(table.schema[0]) # pyarrow.Field<created: timestamp[ns]> (nanosecond
> units)
> print(table2.schema[0]) # pyarrow.Field<created: timestamp[us]> (microsecond
> units)
> {code}
> This was closed as a limitation of the parquet 1.x format for representing
> nanosecond timestamps. This is fine, however, the arrow schema embedded
> within the parquet metadata still lists the data as being a nanosecond array.
> This causes issues depending on which schema the reader opts to "trust".
> This was discovered as part of the investigation into a bug report on the
> arrow-rs parquet implementation -
> [https://github.com/apache/arrow-rs/issues/1459]
> Specifically the metadata written is
> {code:java}
> Schema {
> endianness: Little,
> fields: Some(
> [
> Field {
> name: Some(
> "created",
> ),
> nullable: true,
> type_type: Timestamp,
> type_: Timestamp {
> unit: NANOSECOND,
> timezone: Some(
> "UTC",
> ),
> },
> dictionary: None,
> children: Some(
> [],
> ),
> custom_metadata: None,
> },
> ],
> ),
> custom_metadata: Some(
> [
> KeyValue {
> key: Some(
> "pandas",
> ),
> value: Some(
> "{\"index_columns\": [], \"column_indexes\": [],
> \"columns\": [{\"name\": \"created\", \"field_name\": \"created\",
> \"pandas_type\": \"datetimetz\", \"numpy_type\": \"datetime64[ns]\",
> \"metadata\": {\"timezone\": \"UTC\"}}], \"creator\": {\"library\":
> \"pyarrow\", \"version\": \"6.0.1\"}, \"pandas_version\": \"1.4.0\"}",
> ),
> },
> ],
> ),
> features: None,
> } {code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)