[ 
https://issues.apache.org/jira/browse/ARROW-14004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17415680#comment-17415680
 ] 

Joris Van den Bossche commented on ARROW-14004:
-----------------------------------------------

That's another reason that I am not fully in favor of such a flag. Personally, 
_I_ would expect that it reliably results in certain dtypes given the schema of 
the Arrow table, regardless of the actual values (whether nulls are present or 
not). 
(of course that currently also not the case, since you get int or float 
depending on it, but that's due to a limitation of numpy). 

Reasons for this: 1) type predictability/stability (the result dtype only 
depending on dtypes of the input, not on actual values) can be very useful, and 
2) the numpy int64 vs nullable Int64 dtypes in pandas _do_ behave differently 
for certain operations (eg propagation of NA in comparisons), so I think it 
should be a conscious choice (eg missing values could be introduced in a next 
step in the pipeline after the arrow->pandas conversion (eg due to a join), and 
then having the nullable Int64, even though not having missing values 
initially, will give a different result).

So I think it is difficult to have a single flag that tailors to the different 
expectations (on the other hand, it's certainly true that, if not done under 
the hood, it's difficult/impossible to get the "nullable dtype only if there 
are nulls" behaviour without converting the floats back to ints. Although after 
the fact converting the nullable int columns to normal int columns should be a 
easy/cheap conversion).


> [Python] to_pandas() converts to float instead of using pandas nullable types
> -----------------------------------------------------------------------------
>
>                 Key: ARROW-14004
>                 URL: https://issues.apache.org/jira/browse/ARROW-14004
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Documentation, Python
>            Reporter: Miguel Cantón Cortés
>            Priority: Major
>              Labels: pandas
>             Fix For: 6.0.0
>
>         Attachments: image.png
>
>
> We've noticed that when converting an Arrow Table to pandas using 
> `.to_pandas()` integer columns with null values get converted to float 
> instead of using pandas nullable types.
> If the column was created with pandas first it is correctly preserved (I 
> guess it's using stored metadata for this).
> I've attached a screenshot showing this behavior.
> As currently there is support for nullable types in pandas, just as in Arrow, 
> it would be great to use these types when dealing with columns with null 
> values.
> If you are reticent to change this behavior, a param would be nice too (e.g. 
> `to_pandas(use_nullable_types: True)`).
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to