[
https://issues.apache.org/jira/browse/ARROW-14004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17415680#comment-17415680
]
Joris Van den Bossche commented on ARROW-14004:
-----------------------------------------------
That's another reason that I am not fully in favor of such a flag. Personally,
_I_ would expect that it reliably results in certain dtypes given the schema of
the Arrow table, regardless of the actual values (whether nulls are present or
not).
(of course that currently also not the case, since you get int or float
depending on it, but that's due to a limitation of numpy).
Reasons for this: 1) type predictability/stability (the result dtype only
depending on dtypes of the input, not on actual values) can be very useful, and
2) the numpy int64 vs nullable Int64 dtypes in pandas _do_ behave differently
for certain operations (eg propagation of NA in comparisons), so I think it
should be a conscious choice (eg missing values could be introduced in a next
step in the pipeline after the arrow->pandas conversion (eg due to a join), and
then having the nullable Int64, even though not having missing values
initially, will give a different result).
So I think it is difficult to have a single flag that tailors to the different
expectations (on the other hand, it's certainly true that, if not done under
the hood, it's difficult/impossible to get the "nullable dtype only if there
are nulls" behaviour without converting the floats back to ints. Although after
the fact converting the nullable int columns to normal int columns should be a
easy/cheap conversion).
> [Python] to_pandas() converts to float instead of using pandas nullable types
> -----------------------------------------------------------------------------
>
> Key: ARROW-14004
> URL: https://issues.apache.org/jira/browse/ARROW-14004
> Project: Apache Arrow
> Issue Type: Bug
> Components: Documentation, Python
> Reporter: Miguel Cantón Cortés
> Priority: Major
> Labels: pandas
> Fix For: 6.0.0
>
> Attachments: image.png
>
>
> We've noticed that when converting an Arrow Table to pandas using
> `.to_pandas()` integer columns with null values get converted to float
> instead of using pandas nullable types.
> If the column was created with pandas first it is correctly preserved (I
> guess it's using stored metadata for this).
> I've attached a screenshot showing this behavior.
> As currently there is support for nullable types in pandas, just as in Arrow,
> it would be great to use these types when dealing with columns with null
> values.
> If you are reticent to change this behavior, a param would be nice too (e.g.
> `to_pandas(use_nullable_types: True)`).
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)