Joris Van den Bossche created ARROW-5427:
--------------------------------------------

             Summary: [Python] RangeIndex serialization change implications
                 Key: ARROW-5427
                 URL: https://issues.apache.org/jira/browse/ARROW-5427
             Project: Apache Arrow
          Issue Type: Improvement
          Components: Python
    Affects Versions: 0.13.0
            Reporter: Joris Van den Bossche
             Fix For: 0.14.0


In 0.13, the conversion of a pandas DataFrame's RangeIndex changed: it is no 
longer serialized as an actual column in the arrow table, but only saved as 
metadata (in the pandas metadata) (ARROW-1639).

This change lead to a couple of issues:

- It can sometimes be unpredictable in pandas when you have a RangeIndex and 
when not. Which means that the resulting schema in arrow can be somewhat 
unexpected. See ARROW-5104: empty DataFrame has RangeIndex or not depending on 
how it was created
- The metadata is not always enough (or not updated) to reconstruct it when the 
table has been modified / subsetted.  
  For example, ARROW-5138: retrieving a single row group from parquet file 
doesn't restore index properly (since the RangeIndex metadata was for the full 
table, not this subset)
  And another one, ARROW-5139: empty column selection no longer restores index.

I think we should decide if we either want to try to fix those (or give an 
option to avoid those issues), or either close those as "won't fix".

One idea I had that could potentially alleviate some of those issues:

- Make it possible for the user to still force actual serialization of the 
index, always, even if it is a RangeIndex.
- To not introduce a new option, we could reuse the {{preserve_index}} keyword: 
change the default to None (which means the current behaviour), and change 
{{True}} to mean "always serialize" (although this is not fully backwards 
compatible with 0.13.0 for those users who explicitly specified the keyword).

I am not sure this is worth the added complexity (although I personally like 
providing the option where the index is simply always serialized as columns, 
without surprises). But ideally we decide on it for 0.14, to either fix or 
close the mentioned issues.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to