[
https://issues.apache.org/jira/browse/ARROW-14938?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17452138#comment-17452138
]
Weston Pace commented on ARROW-14938:
-------------------------------------
I added some info on the GH issues ticket too. My guess is that "hive" didn't
work because you were specifying it on the read only and not the write.
> Partition column dissappear when reading dataset
> ------------------------------------------------
>
> Key: ARROW-14938
> URL: https://issues.apache.org/jira/browse/ARROW-14938
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 6.0.1
> Environment: Debian bullseye, python 3.9
> Reporter: Martin Gran
> Priority: Major
>
> Appending CSV to parquet dataset with partitioning on "code".
> {code:python}
> table = pa.Table.from_pandas(chunk)
> pa.dataset.write_dataset(
> table,
> output_path,
> basename_template=f"chunk_\{y}_\{{i}}",
> format="parquet",
> partitioning=["code"],
> existing_data_behavior="overwrite_or_ignore",
> )
> {code}
> Loading the dataset again and expecting code to be in the dataframe.
> {code:python}
> import pyarrow.dataset as ds
> dataset = ds.dataset("../data/interim/2020_elements_parquet/",
> format="parquet",)
> df = dataset.to_table().to_pandas()
> >>>df["code"]
> {code}
> Trace
> {code:python}
> ---------------------------------------------------------------------------
> KeyError Traceback (most recent call last)
> ~/.local/lib/python3.9/site-packages/pandas/core/indexes/base.py in
> get_loc(self, key, method, tolerance) 3360 try: -> 3361 return
> self._engine.get_loc(casted_key) 3362 except KeyError as err:
> ~/.local/lib/python3.9/site-packages/pandas/_libs/index.pyx in
> pandas._libs.index.IndexEngine.get_loc()
> ~/.local/lib/python3.9/site-packages/pandas/_libs/index.pyx in
> pandas._libs.index.IndexEngine.get_loc()
> pandas/_libs/hashtable_class_helper.pxi in
> pandas._libs.hashtable.PyObjectHashTable.get_item()
> pandas/_libs/hashtable_class_helper.pxi in
> pandas._libs.hashtable.PyObjectHashTable.get_item() KeyError: 'code' The
> above exception was the direct cause of the following exception: KeyError
> Traceback (most recent call last) /tmp/ipykernel_24875/4149106129.py in
> <module> ----> 1 df["code"]
> ~/.local/lib/python3.9/site-packages/pandas/core/frame.py in
> __getitem__(self, key) 3456 if self.columns.nlevels > 1: 3457 return
> self._getitem_multilevel(key) -> 3458 indexer = self.columns.get_loc(key)
> 3459 if is_integer(indexer): 3460 indexer = [indexer]
> ~/.local/lib/python3.9/site-packages/pandas/core/indexes/base.py in
> get_loc(self, key, method, tolerance) 3361 return
> self._engine.get_loc(casted_key) 3362 except KeyError as err: -> 3363 raise
> KeyError(key) from err 3364 3365 if is_scalar(key) and isna(key) and not
> self.hasnans: KeyError: 'code'
> {code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)