[ 
https://issues.apache.org/jira/browse/ARROW-18269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17634143#comment-17634143
 ] 

Vibhatha Lakmal Abeykoon commented on ARROW-18269:
--------------------------------------------------

Nice idea, although when we are reading, how do we know if this is encoded 
field or not? Just imagine the original key was `A%2FZ` and even if we detected 
encoded data, if it was the original data in the dataset, how do we know 
whether to decode or not? Or we just do encode decode no matter what. Is it a 
wise thing to do considering the performance?

> [C++] Slash character in partition value handling
> -------------------------------------------------
>
>                 Key: ARROW-18269
>                 URL: https://issues.apache.org/jira/browse/ARROW-18269
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 10.0.0
>            Reporter: Vadym Dytyniak
>            Assignee: Vibhatha Lakmal Abeykoon
>            Priority: Major
>              Labels: good-first-issue
>
>  
> Provided example shows that pyarrow does not handle partition value that 
> contains '/' correctly:
> {code:java}
> import pandas as pd
> import pyarrow as pa
> from pyarrow import dataset as ds
> df = pd.DataFrame({
>     'value': [1, 2],
>     'instrument_id': ['A/Z', 'B'],
> })
> ds.write_dataset(
>     data=pa.Table.from_pandas(df),
>     base_dir='data',
>     format='parquet',
>     partitioning=['instrument_id'],
>     partitioning_flavor='hive',
> )
> table = ds.dataset(
>     source='data',
>     format='parquet',
>     partitioning='hive',
> ).to_table()
> tables = [table]
> df = pa.concat_tables(tables).to_pandas()  tables = [table]
> df = pa.concat_tables(tables).to_pandas() 
> print(df.head()){code}
> Result:
> {code:java}
>    value instrument_id
> 0      1             A
> 1      2             B {code}
> Expected behaviour:
> Option 1: Result should be:
> {code:java}
>    value instrument_id
> 0      1             A/Z
> 1      2             B {code}
> Option 2: Error should be raised to avoid '/' in partition value.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to