[jira] [Commented] (ARROW-14004) to_pandas() converts to float instead of using pandas nullable types

Joris Van den Bossche (Jira) Wed, 15 Sep 2021 08:35:11 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-14004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17415597#comment-17415597
 ]


Joris Van den Bossche commented on ARROW-14004:
-----------------------------------------------

> If the column was created with pandas first it is correctly preserved (I 
> guess it's using stored metadata for this).

That's correct.

> As currently there is support for nullable types in pandas, just as in Arrow, 
> it would be great to use these types when dealing with columns with null 
> values.

Since pandas does not yet use those nullable dtypes as the default in pandas 
(and there are still quite some parts of pandas that don't yet support them 
fully), I think pyarrow should also not yet use them by default.

> If you are reticent to change this behavior, a param would be nice too (e.g. 
> `to_pandas(use_nullable_types: True)`).

There is actually already a keyword to customize the dtype used in the 
conversion to pandas, to support extension dtypes in general: {{types_mapper}}.

And this can be used to get the effect you want (only for int64):

{code:python}
table.to_pandas(types_mapper={pa.int64(): pd.Int64Dtype()}.get)
{code}

Now, if you want this for all integer dtypes (unsigned/signed and all 
bitwidths) this of course gets a bit unwieldy (but you can define this 
dictionary once somewhere in your code and then re-use that). 

For such a case, adding a {{use_nullable_dtypes}} keyword would be a nice 
short-cut (eg pandas' {{read_parquet}} already has this, and uses the 
{{type_mapper}} under the hood). However, I am a bit hesitant to add such a 
keyword as people might expect different behaviour from this (for example, 
pandas also has nullable float and string dtypes, and depending on your use 
case you might want to use nullable ints but not nullable floats (as for floats 
there is less benefit in using it). But a general keyword should probably 
enable all of them).

> to_pandas() converts to float instead of using pandas nullable types
> --------------------------------------------------------------------
>
>                 Key: ARROW-14004
>                 URL: https://issues.apache.org/jira/browse/ARROW-14004
>             Project: Apache Arrow
>          Issue Type: Bug
>            Reporter: Miguel Cantón Cortés
>            Priority: Major
>         Attachments: image.png
>
>
> We've noticed that when converting an Arrow Table to pandas using 
> `.to_pandas()` integer columns with null values get converted to float 
> instead of using pandas nullable types.
> If the column was created with pandas first it is correctly preserved (I 
> guess it's using stored metadata for this).
> I've attached a screenshot showing this behavior.
> As currently there is support for nullable types in pandas, just as in Arrow, 
> it would be great to use these types when dealing with columns with null 
> values.
> If you are reticent to change this behavior, a param would be nice too (e.g. 
> `to_pandas(use_nullable_types: True)`).
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (ARROW-14004) to_pandas() converts to float instead of using pandas nullable types

Reply via email to