Julian de Ruiter created ARROW-5666:
---------------------------------------

             Summary: [Python] Underscores in partition (string) values are 
dropped when reading dataset
                 Key: ARROW-5666
                 URL: https://issues.apache.org/jira/browse/ARROW-5666
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 0.13.0
            Reporter: Julian de Ruiter


When reading a partitioned dataset, in which the partition column contains 
string values with underscores, pyarrow seems to be ignoring the underscores in 
the resulting values.

For example if I write and then read a dataset as follows:
{code:java}
import pyarrow as pa
import pandas as pd

df = pd.DataFrame({
    "year_week": ["2019_2", "2019_3"],
    "value": [1, 2]
})

table = pa.Table.from_pandas(df.head())
pq.write_to_dataset(table, 'test', partition_cols=["year_week"])

table2 = pq.ParquetDataset('test').read()
{code}
The resulting 'year_week' column in table 2 has lost the underscores:
{code:java}
table2[1] # Gives:

<Column name='year_week' type=DictionaryType(dictionary<values=int64, 
indices=int32, ordered=0>)>
[

  -- dictionary:
    [
      20192,
      20193
    ]
  -- indices:
    [
      0
    ],

  -- dictionary:
    [
      20192,
      20193
    ]
  -- indices:
    [
      1
    ]
]
{code}
Is this intentional behaviour or is this a bug in arrow?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to