Florian Jetter created ARROW-5138:
-------------------------------------

             Summary: [Python/C++] Row group retrieval doesn't restore index 
properly
                 Key: ARROW-5138
                 URL: https://issues.apache.org/jira/browse/ARROW-5138
             Project: Apache Arrow
          Issue Type: Bug
            Reporter: Florian Jetter


When retrieving row groups the index is no longer properly restored to its 
initial value and is set to an range index starting at zero no matter what. 
version 0.12.1 restored and int64 index with the correct index values.


{code:python}
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
print(pa.__version__)
df = pd.DataFrame(
    {"a": [1, 2, 3, 4]}
)
print("total DF")
print(df.index)
table = pa.Table.from_pandas(df)
buf = pa.BufferOutputStream()
pq.write_table(table, buf, chunk_size=2)
reader = pa.BufferReader(buf.getvalue().to_pybytes())
parquet_file = pq.ParquetFile(reader)
rg = parquet_file.read_row_group(1)

df_restored = rg.to_pandas()
print("Row group")
print(df_restored.index)
{code}

Previous behavior
{code:python}
0.12.1
total DF
RangeIndex(start=0, stop=4, step=1)
Row group
Int64Index([2, 3], dtype='int64')
{code}

Behavior now
{code:python}
0.13.0
total DF
RangeIndex(start=0, stop=4, step=1)
Row group
RangeIndex(start=0, stop=2, step=1)
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to