Martin Gran created ARROW-14938:
-----------------------------------
Summary: Partition column dissappear when reading dataset
Key: ARROW-14938
URL: https://issues.apache.org/jira/browse/ARROW-14938
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 6.0.1
Environment: Debian bullseye, python 3.9
Reporter: Martin Gran
Appending CSV to parquet dataset with partitioning on "code".
{code:python}
table = pa.Table.from_pandas(chunk)
pa.dataset.write_dataset(
table,
output_path,
basename_template=f"chunk_\{y}_\{{i}}",
format="parquet",
partitioning=["code"],
existing_data_behavior="overwrite_or_ignore",
)
{code}
Loading the dataset again and expecting code to be in the dataframe.
{code:python}
import pyarrow.dataset as ds
dataset = ds.dataset("../data/interim/2020_elements_parquet/",
format="parquet",)
df = dataset.to_table().to_pandas()
>>>df["code"]
{code}
Trace
{code:python}
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
~/.local/lib/python3.9/site-packages/pandas/core/indexes/base.py in
get_loc(self, key, method, tolerance) 3360 try: -> 3361 return
self._engine.get_loc(casted_key) 3362 except KeyError as err:
~/.local/lib/python3.9/site-packages/pandas/_libs/index.pyx in
pandas._libs.index.IndexEngine.get_loc()
~/.local/lib/python3.9/site-packages/pandas/_libs/index.pyx in
pandas._libs.index.IndexEngine.get_loc()
pandas/_libs/hashtable_class_helper.pxi in
pandas._libs.hashtable.PyObjectHashTable.get_item()
pandas/_libs/hashtable_class_helper.pxi in
pandas._libs.hashtable.PyObjectHashTable.get_item() KeyError: 'code' The above
exception was the direct cause of the following exception: KeyError Traceback
(most recent call last) /tmp/ipykernel_24875/4149106129.py in <module> ----> 1
df["code"] ~/.local/lib/python3.9/site-packages/pandas/core/frame.py in
__getitem__(self, key) 3456 if self.columns.nlevels > 1: 3457 return
self._getitem_multilevel(key) -> 3458 indexer = self.columns.get_loc(key) 3459
if is_integer(indexer): 3460 indexer = [indexer]
~/.local/lib/python3.9/site-packages/pandas/core/indexes/base.py in
get_loc(self, key, method, tolerance) 3361 return
self._engine.get_loc(casted_key) 3362 except KeyError as err: -> 3363 raise
KeyError(key) from err 3364 3365 if is_scalar(key) and isna(key) and not
self.hasnans: KeyError: 'code'
{code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)