[
https://issues.apache.org/jira/browse/ARROW-18269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17632152#comment-17632152
]
Vibhatha Lakmal Abeykoon commented on ARROW-18269:
--------------------------------------------------
[~westonpace]
So here the context is that, the partition column data is being used to
formulate the save directory path. When there is a '/' in data, this value get
implicitly considered as a separator when we form the directory path. Thus
`A/Z` makes a `A` folder and `Z` inside it. Not sure we can remove that part or
ask the code to ignore it.
But, in the reading part, when we recreate the fragments, we could decide
whether to consider it as a path or just as a single value. If we consider it
as a path (which is being done at the moment), we would get the erroneous
output, but if we say don't consider it as a path, but as a non-path, we could
retrieve the value accurately.
This is one viable option. If we do that, we can provide a lamda or flag to
determine this behavior.
I think a function to determine the key decoding from the file path would be
better.
Is this overly complicated or a non-generic solution?
Although I am inclined towards option 1 and not option 2. Option 2 is pretty
straightforward to do, but a case as mentioned above could be very common.
How is the URL encoding/decoding part relevant here? Am I missing something?
Could you please clarify a bit?
> [C++] Slash character in partition value handling
> -------------------------------------------------
>
> Key: ARROW-18269
> URL: https://issues.apache.org/jira/browse/ARROW-18269
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, Python
> Affects Versions: 10.0.0
> Reporter: Vadym Dytyniak
> Assignee: Vibhatha Lakmal Abeykoon
> Priority: Major
> Labels: good-first-issue
>
>
> Provided example shows that pyarrow does not handle partition value that
> contains '/' correctly:
> {code:java}
> import pandas as pd
> import pyarrow as pa
> from pyarrow import dataset as ds
> df = pd.DataFrame({
> 'value': [1, 2],
> 'instrument_id': ['A/Z', 'B'],
> })
> ds.write_dataset(
> data=pa.Table.from_pandas(df),
> base_dir='data',
> format='parquet',
> partitioning=['instrument_id'],
> partitioning_flavor='hive',
> )
> table = ds.dataset(
> source='data',
> format='parquet',
> partitioning='hive',
> ).to_table()
> tables = [table]
> df = pa.concat_tables(tables).to_pandas() tables = [table]
> df = pa.concat_tables(tables).to_pandas()
> print(df.head()){code}
> Result:
> {code:java}
> value instrument_id
> 0 1 A
> 1 2 B {code}
> Expected behaviour:
> Option 1: Result should be:
> {code:java}
> value instrument_id
> 0 1 A/Z
> 1 2 B {code}
> Option 2: Error should be raised to avoid '/' in partition value.
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)