[jira] [Commented] (ARROW-13756) [Python] Error in pandas conversion for datetimetz column index

Joris Van den Bossche (Jira) Mon, 30 Aug 2021 04:10:05 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-13756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17406682#comment-17406682
 ]


Joris Van den Bossche commented on ARROW-13756:
-----------------------------------------------

A workaround you can use for now is to convert your column names to strings 
before converting to arrow / writing to parquet, and then afterwards convert 
back to datetimes manually:

{code:python}
df.columns = df.columns.astype('str')
# roundtrip to arrow or parquet
table = pa.table(df)
result = table.to_pandas()
# convert string column names to datetimeindex
result.columns = pd.to_datetime(result.columns).tz_convert("CET")
{code}

In general pyarrow tries to recreate the original pandas column labels when 
converting back to pandas, but note that Arrow and Parquet both only support 
string column names anyway (so the data stored is using the stringified names 
anyway).

> [Python] Error in pandas conversion for datetimetz column index
> ---------------------------------------------------------------
>
>                 Key: ARROW-13756
>                 URL: https://issues.apache.org/jira/browse/ARROW-13756
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 5.0.0
>         Environment: Ubuntu 21.04
>            Reporter: Andreas Wolf
>            Priority: Major
>              Labels: pandas
>
> The following code fails with:
> {code:java}
> File "[...]/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 1052, 
> in _pandas_type_to_numpy_type
>  return np.dtype(pandas_type)
> TypeError: data type 'datetimetz' not understood{code}
> Sample:
> {code:java}
> def run():
>     filename = "test.parquet"
>     df = pd.DataFrame(
>         data=range(31),
>         columns=list("A"),
>         index=pd.date_range("2021-01-01", "2021-01-31", freq="D", tz="CET"),
>     ).T
>     table = pa.Table.from_pandas(df)
>     pq.write_to_dataset(table, root_path=filename)
>     result = pq.read_table(filename).to_pandas()
>     return result
> if __name__ == "__main__":
>     run()
> {code}
> The code tries to store a dataframe where the columns are timezone aware 
> datetimes.
> _Observations_:
> If I remove the *.T* at the end of the dataframe, so that the datatime index 
> are rows it is working (but not what I want).
> If I remove the timezone information *tz="CET"* the code is working.
> I assume this bug is related to [Error in pandas conversion for datetimetz 
> row index|https://issues.apache.org/jira/browse/ARROW-1958]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-13756) [Python] Error in pandas conversion for datetimetz column index

Reply via email to