[
https://issues.apache.org/jira/browse/ARROW-12732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17343124#comment-17343124
]
Joris Van den Bossche commented on ARROW-12732:
-----------------------------------------------
Looking into more detail, it seems that indeed just adding a {{import pandas}}
is not enough to trigger the registration of the extension types. It is only
done when actually using any pandas<->arrow conversion the first time. We did
actually have a discussion about this when initially adding those extension
types in pandas:
https://github.com/pandas-dev/pandas/pull/28371#issuecomment-549177206
I first tried to always import pyarrow to register the types to ensure an
{{import pandas}} would be sufficient. But, at the time, that gave a problem
with circular import dependency for older pyarrow versions (as those were
always import pandas), and so I switched to a lazy import/registration
(https://github.com/pandas-dev/pandas/pull/28371#issuecomment-572032856). But
now that pandas requires a more recent version of pyarrow, we can probably
change that.
[~jaidisido] the "workaround" for now is to do {{import
pandas.core.arrays._arrow_utils}}, if you want to be able to trigger the
registration of the extension types with an import. A safer way (the previous
import is private, and can change at any time), might be to add a line like
{{pa.table(pd.DataFrame({'a': pd.period_range("2012", freq="Y", periods=3)}))}}
after your imports. That's a cheap line and will always trigger the
registration.
> [Python] read parquet in pyarrow is not idempotent for time period types
> ------------------------------------------------------------------------
>
> Key: ARROW-12732
> URL: https://issues.apache.org/jira/browse/ARROW-12732
> Project: Apache Arrow
> Issue Type: Bug
> Components: Parquet, Python
> Affects Versions: 3.0.0, 4.0.0
> Reporter: Abderrahmane Jaidi
> Assignee: Joris Van den Bossche
> Priority: Major
> Attachments: period.parquet
>
>
> When reading a parquet file (attached) with a period type column via the
> "read_table" method, it returns "int64" on the first read. After applying
> "to_pandas" to the pyarrow table, subsequent "read_table" calls of the same
> parquet file in the same *Python session* return "ArrowPeriodType"
> {code:java}
> import pyarrow
> import pyarrow.parquet
> pq_table =
> pyarrow.parquet.read_table("s3://my-bucket/my-prefix/period.parquet")
> print(pq_table.schema.types)
> # Out[1]: [DataType(int64)]
> print(pq_table.to_pandas())
> # Out[2]:
> # col
> # 0 2010-01
> pq_table =
> pyarrow.parquet.read_table("s3://my-bucket/my-prefix/period.parquet")
> print(pq_table.schema.types)
> # Out[3]: [ArrowPeriodType(DataType(int64))]
> pq_table =
> pyarrow.parquet.read_table("s3://my-bucket/my-prefix/period.parquet")
> print(pq_table.schema.types)
> # Out[4]: [ArrowPeriodType(DataType(int64))]{code}
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)