Andy Douglas created ARROW-11388:
------------------------------------
Summary: Dataset Timezone Handling
Key: ARROW-11388
URL: https://issues.apache.org/jira/browse/ARROW-11388
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 3.0.0, 2.0.0
Reporter: Andy Douglas
I'm trying to write a pandas dataframe with a datetimeindex with timezone
information to a pyarrow dataset but the timezone information doesn't seem to
be written (apart from in the pandas metadata)
For example
{code:java}
import os
import pandas as pd
import numpy as np
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds
from pathlib import Path
print(pa.__version__)
# create dummy dataframe with datetime index containing tz info
df = pd.DataFrame(
dict(
timestamp=pd.date_range("2021-01-01", freq="1T", periods=100,
tz="US/Eastern"),
x=np.arange(100),
)
).set_index("timestamp")
test_dir = Path("test_dir")
table = pa.Table.from_pandas(df)
schema = table.schema
print(schema)
print(schema.pandas_metadata)
pq.write_to_dataset(table, test_dir)
print(pq.ParquetFile(test_dir / os.listdir(test_dir)[0]).read())
dataset = ds.dataset(test_dir, format="parquet", schema=schema)
dataset.to_table()
{code}
Is this a bug or am I missing something?
Thanks
Andy
--
This message was sent by Atlassian Jira
(v8.3.4#803005)