[jira] [Commented] (ARROW-18269) [C++] Slash character in partition value handling

Vibhatha Lakmal Abeykoon (Jira) Fri, 11 Nov 2022 00:19:38 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-18269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17632152#comment-17632152
 ]


Vibhatha Lakmal Abeykoon commented on ARROW-18269:
--------------------------------------------------

[~westonpace] 

So here the context is that, the partition column data is being used to 
formulate the save directory path. When there is a '/' in data, this value get 
implicitly considered as a separator when we form the directory path. Thus 
`A/Z` makes a `A` folder and `Z` inside it. Not sure we can remove that part or 
ask the code to ignore it. 

But, in the reading part, when we recreate the fragments, we could decide 
whether to consider it as a path or just as a single value. If we consider it 
as a path (which is being done at the moment), we would get the erroneous 
output, but if we say don't consider it as a path, but as a non-path, we could 
retrieve the value accurately. 

This is one viable option. If we do that, we can provide a lamda or flag to 
determine this behavior. 

I think a function to determine the key decoding from the file path would be 
better. 

Is this overly complicated or a non-generic solution?

Although I am inclined towards option 1 and not option 2. Option 2 is pretty 
straightforward to do, but a case as mentioned above could be very common.

How is the URL encoding/decoding part relevant here? Am I missing something?

Could you please clarify a bit? 

> [C++] Slash character in partition value handling
> -------------------------------------------------
>
>                 Key: ARROW-18269
>                 URL: https://issues.apache.org/jira/browse/ARROW-18269
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 10.0.0
>            Reporter: Vadym Dytyniak
>            Assignee: Vibhatha Lakmal Abeykoon
>            Priority: Major
>              Labels: good-first-issue
>
>  
> Provided example shows that pyarrow does not handle partition value that 
> contains '/' correctly:
> {code:java}
> import pandas as pd
> import pyarrow as pa
> from pyarrow import dataset as ds
> df = pd.DataFrame({
>     'value': [1, 2],
>     'instrument_id': ['A/Z', 'B'],
> })
> ds.write_dataset(
>     data=pa.Table.from_pandas(df),
>     base_dir='data',
>     format='parquet',
>     partitioning=['instrument_id'],
>     partitioning_flavor='hive',
> )
> table = ds.dataset(
>     source='data',
>     format='parquet',
>     partitioning='hive',
> ).to_table()
> tables = [table]
> df = pa.concat_tables(tables).to_pandas()  tables = [table]
> df = pa.concat_tables(tables).to_pandas() 
> print(df.head()){code}
> Result:
> {code:java}
>    value instrument_id
> 0      1             A
> 1      2             B {code}
> Expected behaviour:
> Option 1: Result should be:
> {code:java}
>    value instrument_id
> 0      1             A/Z
> 1      2             B {code}
> Option 2: Error should be raised to avoid '/' in partition value.
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-18269) [C++] Slash character in partition value handling

Reply via email to