[
https://issues.apache.org/jira/browse/ARROW-5666?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17060975#comment-17060975
]
Joris Van den Bossche commented on ARROW-5666:
----------------------------------------------
This now works with the new Datasets API:
{code}
In [2]: import pyarrow.dataset as ds
In [3]: dataset = ds.dataset("test/", format="parquet", partitioning="hive")
In [4]: dataset.schema
Out[4]:
value: int64
year_week: string
In [5]: dataset.to_table().to_pandas()
Out[5]:
value year_week
0 1 2019_2
1 2 2019_3
{code}
So once we start using this new code in the parquet module (ARROW-8039), this
issue should get resolved.
> [Python] Underscores in partition (string) values are dropped when reading
> dataset
> ----------------------------------------------------------------------------------
>
> Key: ARROW-5666
> URL: https://issues.apache.org/jira/browse/ARROW-5666
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.13.0
> Reporter: Julian de Ruiter
> Priority: Major
> Labels: dataset-parquet-read, parquet
>
> When reading a partitioned dataset, in which the partition column contains
> string values with underscores, pyarrow seems to be ignoring the underscores
> in the resulting values.
> For example if I write and then read a dataset as follows:
> {code:java}
> import pyarrow as pa
> import pandas as pd
> df = pd.DataFrame({
> "year_week": ["2019_2", "2019_3"],
> "value": [1, 2]
> })
> table = pa.Table.from_pandas(df.head())
> pq.write_to_dataset(table, 'test', partition_cols=["year_week"])
> table2 = pq.ParquetDataset('test').read()
> {code}
> The resulting 'year_week' column in table 2 has lost the underscores:
> {code:java}
> table2[1] # Gives:
> <Column name='year_week' type=DictionaryType(dictionary<values=int64,
> indices=int32, ordered=0>)>
> [
> -- dictionary:
> [
> 20192,
> 20193
> ]
> -- indices:
> [
> 0
> ],
> -- dictionary:
> [
> 20192,
> 20193
> ]
> -- indices:
> [
> 1
> ]
> ]
> {code}
> Is this intentional behaviour or is this a bug in arrow?
--
This message was sent by Atlassian Jira
(v8.3.4#803005)