Christian Thiel created ARROW-3861:
--------------------------------------
Summary: ParquetDataset().read columns argument always returns
partition column
Key: ARROW-3861
URL: https://issues.apache.org/jira/browse/ARROW-3861
Project: Apache Arrow
Issue Type: Bug
Reporter: Christian Thiel
I just noticed that no matter which columns are specified on load of a dataset,
the partition column is always returned. This might lead to strange behaviour,
as the resulting dataframe has more than the expected columns:
{code}
import dask as da
import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
import os
import numpy as np
import shutil
PATH_PYARROW_MANUAL = '/tmp/pyarrow_manual.pa/'
if os.path.exists(PATH_PYARROW_MANUAL):
shutil.rmtree(PATH_PYARROW_MANUAL)
os.mkdir(PATH_PYARROW_MANUAL)
arrays = np.array([np.array([0, 1, 2]), np.array([3, 4]), np.nan, np.nan])
strings = np.array([np.nan, np.nan, 'a', 'b'])
df = pd.DataFrame([0, 0, 1, 1], columns=['partition_column'])
df.index.name='DPRD_ID'
df['arrays'] = pd.Series(arrays)
df['strings'] = pd.Series(strings)
my_schema = pa.schema([('DPRD_ID', pa.int64()),
('partition_column', pa.int32()),
('arrays', pa.list_(pa.int32())),
('strings', pa.string()),
('new_column', pa.string())])
table = pa.Table.from_pandas(df, schema=my_schema)
pq.write_to_dataset(table, root_path=PATH_PYARROW_MANUAL,
partition_cols=['partition_column'])
df_pq = pq.ParquetDataset(PATH_PYARROW_MANUAL).read(columns=['DPRD_ID',
'strings']).to_pandas()
# pd.read_parquet(PATH_PYARROW_MANUAL, columns=['DPRD_ID', 'strings'],
engine='pyarrow')
df_pq
{code}
df_pq has column `partition_column`
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)