[
https://issues.apache.org/jira/browse/ARROW-5427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rok Mihevc updated ARROW-5427:
------------------------------
External issue URL: https://github.com/apache/arrow/issues/21880
> [Python] RangeIndex serialization change implications
> -----------------------------------------------------
>
> Key: ARROW-5427
> URL: https://issues.apache.org/jira/browse/ARROW-5427
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Affects Versions: 0.13.0
> Reporter: Joris Van den Bossche
> Assignee: Joris Van den Bossche
> Priority: Blocker
> Labels: pull-request-available
> Fix For: 0.14.0
>
> Time Spent: 1h 10m
> Remaining Estimate: 0h
>
> In 0.13, the conversion of a pandas DataFrame's RangeIndex changed: it is no
> longer serialized as an actual column in the arrow table, but only saved as
> metadata (in the pandas metadata) (ARROW-1639).
> This change lead to a couple of issues:
> - It can sometimes be unpredictable in pandas when you have a RangeIndex and
> when not. Which means that the resulting schema in arrow can be somewhat
> unexpected. See ARROW-5104: empty DataFrame has RangeIndex or not depending
> on how it was created
> - The metadata is not always enough (or not updated) to reconstruct it when
> the table has been modified / subsetted.
> For example, ARROW-5138: retrieving a single row group from parquet file
> doesn't restore index properly (since the RangeIndex metadata was for the
> full table, not this subset)
> And another one, ARROW-5139: empty column selection no longer restores
> index.
> I think we should decide if we either want to try to fix those (or give an
> option to avoid those issues), or either close those as "won't fix".
> One idea I had that could potentially alleviate some of those issues:
> - Make it possible for the user to still force actual serialization of the
> index, always, even if it is a RangeIndex.
> - To not introduce a new option, we could reuse the {{preserve_index}}
> keyword: change the default to None (which means the current behaviour), and
> change {{True}} to mean "always serialize" (although this is not fully
> backwards compatible with 0.13.0 for those users who explicitly specified the
> keyword).
> I am not sure this is worth the added complexity (although I personally like
> providing the option where the index is simply always serialized as columns,
> without surprises). But ideally we decide on it for 0.14, to either fix or
> close the mentioned issues.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)