Vadym Dytyniak created ARROW-18269:
--------------------------------------
Summary: Slash character in partition value handling
Key: ARROW-18269
URL: https://issues.apache.org/jira/browse/ARROW-18269
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 10.0.0
Reporter: Vadym Dytyniak
Provided example shows that pyarrow does not handle partition value that
contains '/' correctly:
{code:java}
import pandas as pd
import pyarrow as pa
from pyarrow import dataset as ds
df = pd.DataFrame({
'value': [1, 2],
'instrument_id': ['A/Z', 'B'],
})
ds.write_dataset(
data=pa.Table.from_pandas(df),
base_dir='data',
format='parquet',
partitioning=['instrument_id'],
partitioning_flavor='hive',
)
table = ds.dataset(
source='data',
format='parquet',
partitioning='hive',
).to_table()
tables = [table]
df = pa.concat_tables(tables).to_pandas() tables = [table]
df = pa.concat_tables(tables).to_pandas()
print(df.head()){code}
{code:java}
value instrument_id
0 1 A
1 2 B {code}
Expected behaviour:
Option 1: Result should be:
{code:java}
value instrument_id
0 1 A/Z
1 2 B {code}
Option 2: Error should be raised to avoid '/' in partition value.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)