[
https://issues.apache.org/jira/browse/ARROW-11388?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17272727#comment-17272727
]
Joris Van den Bossche commented on ARROW-11388:
-----------------------------------------------
bq. Do you have any suggestions for how I would handle the above? Would you
suggest doing the schema checking within the library and not passing the schema
parameter on pyarrow dataset read/write calls?
Specifically for reading, I would indeed not pass the schema to the dataset
read call (as that will error if not matching exactly, as your report above
shows). We certainly want to make this work in the next release, but for now I
would advise to read it in first as is, and then (if the schema of the dataset
doesn't match with the known schema), you could still cast the resulting table
to that schema (so {{ds.dataset(...).to_table().cast(schema)}} instead of
{{ds.dataset(..., schema=schema).to_table()}}). In the end, when we add
support, it will basically also be a cast under the hood.
For writing, you can ensure the correct schema on conversion from pandas ->
pyarrow, that should work already fine, I think?
bq. Separately, I also see an issue on write for indexed pandas dataframes
where the index column is duplicated in the pandas metadata without the
timezone information being added. I'll raise a separate issue for this.
Yes, please do (I recall an issue about duplicated columns for the index, so
this aspect might already be solved in pyarrow 3.0)
> [Python] Dataset Timezone Handling
> ----------------------------------
>
> Key: ARROW-11388
> URL: https://issues.apache.org/jira/browse/ARROW-11388
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 2.0.0, 3.0.0
> Reporter: Andy Douglas
> Priority: Minor
>
> I'm trying to write a pandas dataframe with a datetimeindex with timezone
> information to a pyarrow dataset but the timezone information doesn't seem to
> be written (apart from in the pandas metadata)
>
> For example
>
> {code:java}
> import os
> import pandas as pd
> import numpy as np
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pyarrow.dataset as ds
> from pathlib import Path
> # I've tried with both v2.0 and v3.0 today
> print(pa.__version__)
> # create dummy dataframe with datetime index containing tz info
> df = pd.DataFrame(
> dict(
> timestamp=pd.date_range("2021-01-01", freq="1T", periods=100,
> tz="US/Eastern"),
> x=np.arange(100),
> )
> ).set_index("timestamp")
> test_dir = Path("test_dir")
> table = pa.Table.from_pandas(df)
> schema = table.schema
> print(schema)
> print(schema.pandas_metadata)
> # warning - creates dir in cwd
> pq.write_to_dataset(table, test_dir)
> # timestamp column is us and UTC
> print(pq.ParquetFile(test_dir / os.listdir(test_dir)[0]).read())
> # create dataset using schema from earlier
> dataset = ds.dataset(test_dir, format="parquet", schema=schema)
> # doesn't work
> dataset.to_table()
> {code}
>
>
> Is this a bug or am I missing something?
> Thanks
> Andy
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)