[ 
https://issues.apache.org/jira/browse/ARROW-2429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dave Challis updated ARROW-2429:
--------------------------------
    Description: 
When creating an Arrow table from a Pandas DataFrame, the table schema contains 
a field of type `timestamp[ns]`.

When serialising that table to a parquet file and then immediately reading it 
back, the schema of the table read instead contains a field with type 
`timestamp[us]`.

Minimal example:
 
{code:python}
#!/usr/bin/env python

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd

# create DataFrame with a datetime column
df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']})
df['created'] = pd.to_datetime(df['created'])

# create Arrow table from DataFrame
table = pa.Table.from_pandas(df, preserve_index=False)

# write the table as a parquet file, then read it back again
pq.write_table(table, 'foo.parquet')
table2 = pq.read_table('foo.parquet')

print(table.schema[0])  # pyarrow.Field<created: timestamp[ns]> (nanosecond 
units)
print(table2.schema[0]) # pyarrow.Field<created: timestamp[us]> (microsecond 
units)
{code}

  was:
When creating an Arrow table from a Pandas DataFrame, the table schema contains 
a field of type `timestamp[ns]`.

When serialising that table to a parquet file and then immediately reading it 
back, the schema of the table read instead contains a field with type 
`timestamp[us]`.

 
{code:python}
#!/usr/bin/env python

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd

# create DataFrame with a datetime column
df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']})
df['created'] = pd.to_datetime(df['created'])

# create Arrow table from DataFrame
table = pa.Table.from_pandas(df, preserve_index=False)

# write the table as a parquet file, then read it back again
pq.write_table(table, 'foo.parquet')
table2 = pq.read_table('foo.parquet')

print(table.schema[0])  # pyarrow.Field<created: timestamp[ns]> (nanosecond 
units)
print(table2.schema[0]) # pyarrow.Field<created: timestamp[us]> (microsecond 
units)
{code}


> [Python] Timestamp unit in schema changes when writing to Parquet file then 
> reading back
> ----------------------------------------------------------------------------------------
>
>                 Key: ARROW-2429
>                 URL: https://issues.apache.org/jira/browse/ARROW-2429
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.9.0
>         Environment: Mac OS High Sierra
> PyArrow 0.9.0 (py36_1)
> Python
>            Reporter: Dave Challis
>            Priority: Minor
>
> When creating an Arrow table from a Pandas DataFrame, the table schema 
> contains a field of type `timestamp[ns]`.
> When serialising that table to a parquet file and then immediately reading it 
> back, the schema of the table read instead contains a field with type 
> `timestamp[us]`.
> Minimal example:
>  
> {code:python}
> #!/usr/bin/env python
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> # create DataFrame with a datetime column
> df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']})
> df['created'] = pd.to_datetime(df['created'])
> # create Arrow table from DataFrame
> table = pa.Table.from_pandas(df, preserve_index=False)
> # write the table as a parquet file, then read it back again
> pq.write_table(table, 'foo.parquet')
> table2 = pq.read_table('foo.parquet')
> print(table.schema[0])  # pyarrow.Field<created: timestamp[ns]> (nanosecond 
> units)
> print(table2.schema[0]) # pyarrow.Field<created: timestamp[us]> (microsecond 
> units)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to