Martin Gran created ARROW-14938:
-----------------------------------

             Summary: Partition column dissappear when reading dataset
                 Key: ARROW-14938
                 URL: https://issues.apache.org/jira/browse/ARROW-14938
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 6.0.1
         Environment: Debian bullseye, python 3.9
            Reporter: Martin Gran


Appending CSV to parquet dataset with partitioning on "code".
{code:python}
table = pa.Table.from_pandas(chunk)
        pa.dataset.write_dataset(
            table,
            output_path,
            basename_template=f"chunk_\{y}_\{{i}}",
            format="parquet",
            partitioning=["code"],
            existing_data_behavior="overwrite_or_ignore",
        )
{code}
Loading the dataset again and expecting code to be in the dataframe.
{code:python}
import pyarrow.dataset as ds
dataset = ds.dataset("../data/interim/2020_elements_parquet/", 
format="parquet",)
df = dataset.to_table().to_pandas()

>>>df["code"]
{code}
Trace
{code:python}
--------------------------------------------------------------------------- 
KeyError Traceback (most recent call last) 
~/.local/lib/python3.9/site-packages/pandas/core/indexes/base.py in 
get_loc(self, key, method, tolerance)  3360 try: -> 3361 return 
self._engine.get_loc(casted_key)  3362 except KeyError as err: 
~/.local/lib/python3.9/site-packages/pandas/_libs/index.pyx in 
pandas._libs.index.IndexEngine.get_loc() 
~/.local/lib/python3.9/site-packages/pandas/_libs/index.pyx in 
pandas._libs.index.IndexEngine.get_loc() 
pandas/_libs/hashtable_class_helper.pxi in 
pandas._libs.hashtable.PyObjectHashTable.get_item() 
pandas/_libs/hashtable_class_helper.pxi in 
pandas._libs.hashtable.PyObjectHashTable.get_item() KeyError: 'code' The above 
exception was the direct cause of the following exception: KeyError Traceback 
(most recent call last) /tmp/ipykernel_24875/4149106129.py in <module> ----> 1 
df["code"] ~/.local/lib/python3.9/site-packages/pandas/core/frame.py in 
__getitem__(self, key)  3456 if self.columns.nlevels > 1:  3457 return 
self._getitem_multilevel(key) -> 3458 indexer = self.columns.get_loc(key)  3459 
if is_integer(indexer):  3460 indexer = [indexer] 
~/.local/lib/python3.9/site-packages/pandas/core/indexes/base.py in 
get_loc(self, key, method, tolerance)  3361 return 
self._engine.get_loc(casted_key)  3362 except KeyError as err: -> 3363 raise 
KeyError(key) from err  3364  3365 if is_scalar(key) and isna(key) and not 
self.hasnans: KeyError: 'code'
{code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to