Matt Nizol created ARROW-13578:
----------------------------------
Summary: Inconsistent handling of integer-valued partitions in
dataset filters API
Key: ARROW-13578
URL: https://issues.apache.org/jira/browse/ARROW-13578
Project: Apache Arrow
Issue Type: Bug
Reporter: Matt Nizol
When creating a partitioned data set via pandas.to_parquet() method, partition
columns are ostensibly cast to strings in the partition metadata. When reading
specific partitions via the filters parameter in pandas.read_parquet(), string
values must be used for filter operands _except when_ the partition column has
an integer value.
Consider the following example:
{code:python}
import datetime
import pandas as pd
df = pd.DataFrame({
"key1": ['0', '1', '2'],
"key2": [0, 1, 2],
"key3": ['a', 'b', 'c'],
"key4": [1.1, 2.2, 3.3],
"key5": [True, False, True],
"key6": [datetime.date(2021, 6, 2), datetime.date(2021, 6, 3),
datetime.date(2021, 6, 4)],
"data": ["foo", "bar", "baz"]
})
df['key6'] = pd.to_datetime(df['key6'])
df.to_parquet('./test.parquet', partition_cols=['key1', 'key2', 'key3', 'key4',
'key5', 'key6'])
{code}
Reading into a ParquetDataset and inspecting the partition levels suggests that
partition keys have been cast to integer, regardless of the original type:
{code:python}
ds = pq.ParquetDataset('./test.parquet')
for level in ds.partitions.levels:
print(f"{level.name}: {level.keys}")
{code}
Output:
{noformat}
key1: ['0', '1', '2']
key2: ['0', '1', '2']
key3: ['a', 'b', 'c']
key4: ['1.1', '2.2', '3.3']
key5: ['True', 'False']
key6: ['2021-06-02 00:00:00', '2021-06-03 00:00:00', '2021-06-04
00:00:00']{noformat}
Filtering the dataset using any of the non-integer partition keys along with
string-valued operands works as expected:
{code:python}
df2=pd.read_parquet('./test.parquet', filters=[('key4','=','1.1'), ('key5',
'=', 'True')])
df2.head()
{code}
Output:
{noformat}
data key1 key2 key3 key4 key5 key6
0 foo 0 0 a 1.1 True 2021-06-02 00:00:00
{noformat}
However, filtering the dataset using either of the integer-valued partition
keys with a string-valued operand raises an exception, *even when the original
column's data type is string*:
{code:python}
df2=pd.read_parquet('./test.parquet', filters=[('key1','=','1')])
df2.head()
{code}
{noformat}
ArrowNotImplementedError: Function equal has no kernel matching input types
(array[int32], scalar[string])
{noformat}
It would seem to be less surprising / more consistent if filter operands either
(a) are always cast to string, or (b) always retain their original type.
Note, this issue may be related to ARROW-12114.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)