eirki commented on issue #33473: URL: https://github.com/apache/arrow/issues/33473#issuecomment-1932101944
I've been digging into this issue, and I think I've more or less figured it out. Firstly, the reason why this only shows up with Pandas => 1.5.0. It is caused by the response from this function: https://github.com/apache/arrow/blob/c9f6e04323a0b714487a0f707b46fc3c55b909e0/python/pyarrow/pandas_compat.py#L509 handled here: https://github.com/apache/arrow/blob/c9f6e04323a0b714487a0f707b46fc3c55b909e0/python/pyarrow/pandas_compat.py#L385 For some reason, older versions of Pandas will not return a RangeIndex from get_level_values in some cases, despite the index level in question actually being a RangeIndex: ```python # Pandas 1.4.4 In [1]: import pandas as pd ...: ...: df = pd.DataFrame( ...: data=[1, 2, 3], ...: index=pd.MultiIndex.from_arrays([pd.RangeIndex(0, 3), pd.Index([1, 2, 3])]), ...: ) ...: df.index.get_level_values(0) Out[1]: Int64Index([0, 1, 2], dtype='int64') ``` Thus, the special handling of RangeIndex did not kick-in, and the index was treated like a regular serialized index. This behaviour changed in Pandas 1.5.0, and now the special handling of RangeIndex does kick in. This triggers a bug here: https://github.com/apache/arrow/blob/c9f6e04323a0b714487a0f707b46fc3c55b909e0/python/pyarrow/pandas_compat.py#L221 where the iteration `zip`s together `index_levels` `index_descriptors`, and `index_types`, but the `index_types` has already had the serialized (non-RangeIndex) indices filtered out. The result being that the other index level does not get included in the iteration and is not included in the `pandas_metadata` object. I have an idea for a fix, and will try to throw together a pull request. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
