[jira] [Commented] (ARROW-12732) [Python] read parquet in pyarrow is not idempotent for time period types

Joris Van den Bossche (Jira) Wed, 12 May 2021 01:46:04 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-12732?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17343124#comment-17343124
 ]


Joris Van den Bossche commented on ARROW-12732:
-----------------------------------------------

Looking into more detail, it seems that indeed just adding a {{import pandas}} 
is not enough to trigger the registration of the extension types. It is only 
done when actually using any pandas<->arrow conversion the first time. We did 
actually have a discussion about this when initially adding those extension 
types in pandas: 
https://github.com/pandas-dev/pandas/pull/28371#issuecomment-549177206

I first tried to always import pyarrow to register the types to ensure an 
{{import pandas}} would be sufficient. But, at the time, that gave a problem 
with circular import dependency for older pyarrow versions (as those were 
always import pandas), and so I switched to a lazy import/registration 
(https://github.com/pandas-dev/pandas/pull/28371#issuecomment-572032856).  But 
now that pandas requires a more recent version of pyarrow, we can probably 
change that. 

[~jaidisido] the "workaround" for now is to do {{import 
pandas.core.arrays._arrow_utils}}, if you want to be able to trigger the 
registration of the extension types with an import. A safer way (the previous 
import is private, and can change at any time), might be to add a line like 
{{pa.table(pd.DataFrame({'a': pd.period_range("2012", freq="Y", periods=3)}))}} 
after your imports. That's a cheap line and will always trigger the 
registration.



> [Python] read parquet in pyarrow is not idempotent for time period types
> ------------------------------------------------------------------------
>
>                 Key: ARROW-12732
>                 URL: https://issues.apache.org/jira/browse/ARROW-12732
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Parquet, Python
>    Affects Versions: 3.0.0, 4.0.0
>            Reporter: Abderrahmane Jaidi
>            Assignee: Joris Van den Bossche
>            Priority: Major
>         Attachments: period.parquet
>
>
> When reading a parquet file (attached) with a period type column via the 
> "read_table" method, it returns "int64" on the first read. After applying 
> "to_pandas" to the pyarrow table, subsequent "read_table" calls of the same 
> parquet file in the same *Python session* return "ArrowPeriodType"
> {code:java}
> import pyarrow
> import pyarrow.parquet
> pq_table = 
> pyarrow.parquet.read_table("s3://my-bucket/my-prefix/period.parquet")
> print(pq_table.schema.types)
> # Out[1]: [DataType(int64)]
> print(pq_table.to_pandas())
> # Out[2]:
> # col
> # 0 2010-01
> pq_table = 
> pyarrow.parquet.read_table("s3://my-bucket/my-prefix/period.parquet")
> print(pq_table.schema.types)
> # Out[3]: [ArrowPeriodType(DataType(int64))]
> pq_table = 
> pyarrow.parquet.read_table("s3://my-bucket/my-prefix/period.parquet")
> print(pq_table.schema.types)
> # Out[4]: [ArrowPeriodType(DataType(int64))]{code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-12732) [Python] read parquet in pyarrow is not idempotent for time period types

Reply via email to