[jira] [Resolved] (ARROW-5666) [Python] Underscores in partition (string) values are dropped when reading dataset

Wes McKinney (Jira) Mon, 04 May 2020 16:44:09 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-5666?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Wes McKinney resolved ARROW-5666.
---------------------------------
    Resolution: Fixed

Test added in 
https://github.com/apache/arrow/commit/57b50823d6d35a8169dc2f92ae68448a293a89e9

> [Python] Underscores in partition (string) values are dropped when reading 
> dataset
> ----------------------------------------------------------------------------------
>
>                 Key: ARROW-5666
>                 URL: https://issues.apache.org/jira/browse/ARROW-5666
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.13.0
>            Reporter: Julian de Ruiter
>            Assignee: Joris Van den Bossche
>            Priority: Major
>              Labels: dataset-parquet-read, parquet
>             Fix For: 1.0.0
>
>
> When reading a partitioned dataset, in which the partition column contains 
> string values with underscores, pyarrow seems to be ignoring the underscores 
> in the resulting values.
> For example if I write and then read a dataset as follows:
> {code:java}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({
>     "year_week": ["2019_2", "2019_3"],
>     "value": [1, 2]
> })
> table = pa.Table.from_pandas(df.head())
> pq.write_to_dataset(table, 'test', partition_cols=["year_week"])
> table2 = pq.ParquetDataset('test').read()
> {code}
> The resulting 'year_week' column in table 2 has lost the underscores:
> {code:java}
> table2[1] # Gives:
> <Column name='year_week' type=DictionaryType(dictionary<values=int64, 
> indices=int32, ordered=0>)>
> [
>   -- dictionary:
>     [
>       20192,
>       20193
>     ]
>   -- indices:
>     [
>       0
>     ],
>   -- dictionary:
>     [
>       20192,
>       20193
>     ]
>   -- indices:
>     [
>       1
>     ]
> ]
> {code}
> Is this intentional behaviour or is this a bug in arrow?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Resolved] (ARROW-5666) [Python] Underscores in partition (string) values are dropped when reading dataset

Reply via email to