[ https://issues.apache.org/jira/browse/ARROW-5139?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16833597#comment-16833597 ]
Joris Van den Bossche commented on ARROW-5139: ---------------------------------------------- [~fjetter] thanks for the report! A little bit easier reproducible example, without parquet (but the same underlying reason, the rangeindex is indeed not constructed again for empty tables): {code} In [1]: import pyarrow as pa In [2]: pa.__version__ Out[2]: '0.12.0' In [3]: df = pd.DataFrame( ...: {"a": [1, 2]} ...: ) In [4]: table = pa.Table.from_pandas(df, columns=[], preserve_index=True) In [5]: table Out[5]: pyarrow.Table __index_level_0__: int64 metadata -------- OrderedDict([(b'pandas', b'{"index_columns": ["__index_level_0__"], "column_indexes": [' b'{"name": null, "field_name": null, "pandas_type": "unicode",' b' "numpy_type": "object", "metadata": {"encoding": "UTF-8"}}]' b', "columns": [{"name": null, "field_name": "__index_level_0_' b'_", "pandas_type": "int64", "numpy_type": "int64", "metadata' b'": null}], "pandas_version": "0.23.4"}')]) In [6]: print(table.to_pandas()) Empty DataFrame Columns: [] Index: [0, 1] In [7]: table.to_pandas().index Out[7]: Int64Index([0, 1], dtype='int64') {code} But the above, now gives: {code} In [4]: table Out[4]: pyarrow.Table metadata -------- {b'pandas': b'{"index_columns": [{"kind": "range", "name": null, "start": 0, "' b'stop": 2, "step": 1}], "column_indexes": [{"name": null, "field_' b'name": null, "pandas_type": "unicode", "numpy_type": "object", "' b'metadata": {"encoding": "UTF-8"}}], "columns": [], "creator": {"' b'library": "pyarrow", "version": "0.13.1.dev126+ga9ae4a9f.d201905' b'03"}, "pandas_version": "0.24.2"}'} In [5]: print(table.to_pandas()) Empty DataFrame Columns: [] Index: [] In [6]: table.to_pandas().index Out[6]: RangeIndex(start=0, stop=0, step=1) {code} > [Python/C++] Empty column selection no longer restores index > ------------------------------------------------------------ > > Key: ARROW-5139 > URL: https://issues.apache.org/jira/browse/ARROW-5139 > Project: Apache Arrow > Issue Type: Bug > Components: C++, Python > Affects Versions: 0.12.1 > Reporter: Florian Jetter > Priority: Minor > Labels: parquet > > The index of a dataframe is no longer reconstructed when using empty column > selection. This is a regression to 0.12.1 and probably only happens for > pd.RangeIndex > {code:python} > import pandas as pd > import pyarrow as pa > import pyarrow.parquet as pq > from kartothek.serialization import ParquetSerializer > from storefact import get_store_from_url > print(pa.__version__) > df = pd.DataFrame( > {"a": [1, 2]} > ) > print(df.index) > table = pa.Table.from_pandas(df) > buf = pa.BufferOutputStream() > pq.write_table(table, buf) > reader = pa.BufferReader(buf.getvalue().to_pybytes()) > table_restored = pq.read_pandas(reader, columns=[]) > df_restored = table_restored.to_pandas() > print(len(df_restored)) > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)