Florian Jetter created ARROW-5138: ------------------------------------- Summary: [Python/C++] Row group retrieval doesn't restore index properly Key: ARROW-5138 URL: https://issues.apache.org/jira/browse/ARROW-5138 Project: Apache Arrow Issue Type: Bug Reporter: Florian Jetter
When retrieving row groups the index is no longer properly restored to its initial value and is set to an range index starting at zero no matter what. version 0.12.1 restored and int64 index with the correct index values. {code:python} import pandas as pd import pyarrow as pa import pyarrow.parquet as pq print(pa.__version__) df = pd.DataFrame( {"a": [1, 2, 3, 4]} ) print("total DF") print(df.index) table = pa.Table.from_pandas(df) buf = pa.BufferOutputStream() pq.write_table(table, buf, chunk_size=2) reader = pa.BufferReader(buf.getvalue().to_pybytes()) parquet_file = pq.ParquetFile(reader) rg = parquet_file.read_row_group(1) df_restored = rg.to_pandas() print("Row group") print(df_restored.index) {code} Previous behavior {code:python} 0.12.1 total DF RangeIndex(start=0, stop=4, step=1) Row group Int64Index([2, 3], dtype='int64') {code} Behavior now {code:python} 0.13.0 total DF RangeIndex(start=0, stop=4, step=1) Row group RangeIndex(start=0, stop=2, step=1) {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)