Christian Thiel created ARROW-3861: -------------------------------------- Summary: ParquetDataset().read columns argument always returns partition column Key: ARROW-3861 URL: https://issues.apache.org/jira/browse/ARROW-3861 Project: Apache Arrow Issue Type: Bug Reporter: Christian Thiel
I just noticed that no matter which columns are specified on load of a dataset, the partition column is always returned. This might lead to strange behaviour, as the resulting dataframe has more than the expected columns: {code} import dask as da import pyarrow as pa import pyarrow.parquet as pq import pandas as pd import os import numpy as np import shutil PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/' if os.path.exists(PATH_PYARROW_MANUAL): shutil.rmtree(PATH_PYARROW_MANUAL) os.mkdir(PATH_PYARROW_MANUAL) arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan]) strings = np.array([np.nan, np.nan, 'a', 'b']) df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column']) df.index.name='DPRD_ID' df['arrays'] = pd.Series(arrays) df['strings'] = pd.Series(strings) my_schema = pa.schema([('DPRD_ID', pa.int64()), ('partition_column', pa.int32()), ('arrays', pa.list_(pa.int32())), ('strings', pa.string()), ('new_column', pa.string())]) table = pa.Table.from_pandas(df, schema=my_schema) pq.write_to_dataset(table, root_path=PATH_PYARROW_MANUAL, partition_cols=['partition_column']) df_pq = pq.ParquetDataset(PATH_PYARROW_MANUAL).read(columns=['DPRD_ID', 'strings']).to_pandas() # pd.read_parquet(PATH_PYARROW_MANUAL, columns=['DPRD_ID', 'strings'], engine='pyarrow') df_pq {code} df_pq has column `partition_column` -- This message was sent by Atlassian JIRA (v7.6.3#76005)