[
https://issues.apache.org/jira/browse/ARROW-6999?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16960931#comment-16960931
]
Joris Van den Bossche commented on ARROW-6999:
----------------------------------------------
[~goodiegoodman] thanks for the report!
Your "steps to reproduce" actually do work if you do not use an empty dataframe:
{code}
In [15]: import pandas as pd
...: import pyarrow as pa
...: df = pd.DataFrame({'a': [1, 2, 3]})
...: schema = pa.Table.from_pandas(df).schema
...: pa_table = pa.Table.from_pandas(df, schema=schema)
In [16]: schema
Out[16]:
a: int64
metadata
--------
{b'pandas': b'{"index_columns": [{"kind": "range", "name": null, "start": 0, "'
b'stop": 3, "step": 1}], "column_indexes": [{"name": null, "field_'
b'name": null, "pandas_type": "unicode", "numpy_type": "object", "'
b'metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "a", "f'
b'ield_name": "a", "pandas_type": "int64", "numpy_type": "int64", '
b'"metadata": null}], "creator": {"library": "pyarrow", "version":'
b' "0.15.1.dev177+g5df424bd6"}, "pandas_version": "0.26.0.dev0+669'
b'.g3c29114b1"}'}
{code}
The empty dataframe is tricky edge-case regarding the index, because in such a
case the index is not a RangeIndex but a empty object-dtype Index (see
ARROW-5104 for a similar report about that aspect).
That said, if passing an explicit schema, and if there is a column not found
that has a "\_\_index_level_i\_\_" pattern, we should try to handle this
(certainly in case of passing {{preserve_index=True}}).
> [Python] KeyError: '__index_level_0__' passing Table.from_pandas its own
> schema
> -------------------------------------------------------------------------------
>
> Key: ARROW-6999
> URL: https://issues.apache.org/jira/browse/ARROW-6999
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.12.0, 0.13.0, 0.14.0, 0.15.0
> Environment: pandas==0.23.4
> pyarrow==0.15.0 # Issue also with 0.14.0, 0.13.0 & 0.12.0. but not 0.11.0
> Reporter: Tom Goodman
> Priority: Major
> Fix For: 1.0.0
>
>
> Steps to reproduce:
> # Generate any DataFrame's pyarrow Schema using Table.from_pandas
> # Pass the generated schema as input into Table.from_pandas
> # Causes KeyError: '__index_level_0__'
> We did not have this issue with pyarrow==0.11.0 which we used to write many
> partitions across years. Our goal now is to use pyarrow==0.15.0 and produce
> schema going forward that are *backwards compatible* (i.e. also have
> '__index_level_0__'), so we should not need to re-generate all prior years'
> partitions when we migrate to 0.15.0.
> We cannot set _preserve_index=False_, since that effectively deletes
> '__index_level_0__', causing inconsistent schema across earlier partitions
> that had been written using pyarrow==0.11.0.
>
> {code:java}
> import pandas as pd
> import pyarrow as pa
> df = pd.DataFrame()
> schema = pa.Table.from_pandas(df).schema
> pa_table = pa.Table.from_pandas(df, schema=schema)
> {code}
> {noformat}
> Traceback (most recent call last):
> File
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/indexes/base.py",
> line 3078, in get_loc
> return self._engine.get_loc(key)
> File "pandas/_libs/index.pyx", line 140, in
> pandas._libs.index.IndexEngine.get_loc
> File "pandas/_libs/index.pyx", line 162, in
> pandas._libs.index.IndexEngine.get_loc
> File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in
> pandas._libs.hashtable.PyObjectHashTable.get_item
> File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in
> pandas._libs.hashtable.PyObjectHashTable.get_item
> KeyError: '__index_level_0__'
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
> File
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pyarrow/pandas_compat.py",
> line 408, in _get_columns_to_convert_given_schema
> col = df[name]
> File
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/frame.py",
> line 2688, in __getitem__
> return self._getitem_column(key)
> File
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/frame.py",
> line 2695, in _getitem_column
> return self._get_item_cache(key)
> File
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/generic.py",
> line 2489, in _get_item_cache
> values = self._data.get(item)
> File
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/internals.py",
> line 4115, in get
> loc = self.items.get_loc(item)
> File
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pandas/core/indexes/base.py",
> line 3080, in get_loc
> return self._engine.get_loc(self._maybe_cast_indexer(key))
> File "pandas/_libs/index.pyx", line 140, in
> pandas._libs.index.IndexEngine.get_loc
> File "pandas/_libs/index.pyx", line 162, in
> pandas._libs.index.IndexEngine.get_loc
> File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in
> pandas._libs.hashtable.PyObjectHashTable.get_item
> File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in
> pandas._libs.hashtable.PyObjectHashTable.get_item
> KeyError: '__index_level_0__'
> During handling of the above exception, another exception occurred:
> Traceback (most recent call last):
> File
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/IPython/core/interactiveshell.py",
> line 3326, in run_code
> exec(code_obj, self.user_global_ns, self.user_ns)
> File "<ipython-input-36-6711a2fcec96>", line 5, in <module>
> pa_table = pa.Table.from_pandas(df,
> schema=pa.Table.from_pandas(df).schema)
> File "pyarrow/table.pxi", line 1057, in pyarrow.lib.Table.from_pandas
> File
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pyarrow/pandas_compat.py",
> line 517, in dataframe_to_arrays
> columns)
> File
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pyarrow/pandas_compat.py",
> line 337, in _get_columns_to_convert
> return _get_columns_to_convert_given_schema(df, schema, preserve_index)
> File
> "/GAAR/FIAG/sandbox/software/miniconda3/envs/rc_sfi_2019.1/lib/python3.6/site-packages/pyarrow/pandas_compat.py",
> line 426, in _get_columns_to_convert_given_schema
> "in the columns or index".format(name))
> KeyError: "name '__index_level_0__' present in the specified schema is not
> found in the columns or index"
> {noformat}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)