[jira] [Commented] (ARROW-10133) [Python] parquet Int64 col cast to float64 on load in pandas

Joris Van den Bossche (Jira) Wed, 30 Sep 2020 02:37:00 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-10133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17204605#comment-17204605
 ]


Joris Van den Bossche commented on ARROW-10133:
-----------------------------------------------

[~skyetetra] by default, pandas does currently not support missing values in 
integer columns. So once you have integers with missing values, those will be 
casted to floats. See 
https://pandas.pydata.org/docs/user_guide/missing_data.html for more details.

So what you describe is expected behaviour. There is work under way to actually 
support integers with missing values in pandas: de dtype itself will already 
exist, and in the upcoming pandas 1.2 release there will also be an option to 
use those dtypes when reading parquet files 
(https://github.com/pandas-dev/pandas/pull/31242)

> [Python] parquet Int64 col cast to float64 on load in pandas
> ------------------------------------------------------------
>
>                 Key: ARROW-10133
>                 URL: https://issues.apache.org/jira/browse/ARROW-10133
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.17.1
>            Reporter: Jacqueline Nolis
>            Priority: Minor
>         Attachments: example-failed-int64.parquet
>
>
> Under certain conditions a saved parquet table with a column that is Int64 
> and all NA seems to be cast to a float64 with all NaN on load. The desired 
> behavior is to have it stay as Int64. Attached is a table where said issue 
> occurs: the second column here should be a int64 but is being loaded as a 
> float64 in Pandas.
>  
> Interestingly, it seems to be correctly interpreting the column as a Int64 
> when loading in R, so perhaps its only a Pandas issue.
>  
> import pyarrow.parquet as pq
> import boto3
> import pandas as pd
> import io
> obj = boto3.client('s3').get_object(Bucket="...", Key='...') # file attached 
> to ticket
> x = pq.read_table(io.BytesIO(obj['Body'].read()))
> y = x.to_pandas() # this is where the undesired int64 to a float64 cast occurs
> # >>> x
> # pyarrow.Table
> # product_id: string
> # cost: int64
> # name: string
> # >>> y.dtypes
> # product_id object
> # cost float64
> # name object
> # dtype: object



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-10133) [Python] parquet Int64 col cast to float64 on load in pandas

Reply via email to