Matt Nizol created ARROW-13578:
----------------------------------

             Summary: Inconsistent handling of integer-valued partitions in 
dataset filters API
                 Key: ARROW-13578
                 URL: https://issues.apache.org/jira/browse/ARROW-13578
             Project: Apache Arrow
          Issue Type: Bug
            Reporter: Matt Nizol


When creating a partitioned data set via pandas.to_parquet() method, partition 
columns are ostensibly cast to strings in the partition metadata.  When reading 
specific partitions via the filters parameter in pandas.read_parquet(), string 
values must be used for filter operands _except when_ the partition column has 
an integer value.  

Consider the following example:
{code:python}
import datetime
import pandas as pd

df = pd.DataFrame({
    "key1": ['0', '1', '2'], 
    "key2": [0, 1, 2],
    "key3": ['a', 'b', 'c'],
    "key4": [1.1, 2.2, 3.3],
    "key5": [True, False, True],
    "key6": [datetime.date(2021, 6, 2), datetime.date(2021, 6, 3), 
datetime.date(2021, 6, 4)],
    "data": ["foo", "bar", "baz"]
})

df['key6'] = pd.to_datetime(df['key6'])

df.to_parquet('./test.parquet', partition_cols=['key1', 'key2', 'key3', 'key4', 
'key5', 'key6'])
{code}
Reading into a ParquetDataset and inspecting the partition levels suggests that 
partition keys have been cast to integer, regardless of the original type:
{code:python}
ds = pq.ParquetDataset('./test.parquet')
for level in ds.partitions.levels:
    print(f"{level.name}: {level.keys}")
{code}

Output:
{noformat}
key1: ['0', '1', '2']
key2: ['0', '1', '2']
key3: ['a', 'b', 'c']
key4: ['1.1', '2.2', '3.3']
key5: ['True', 'False']
key6: ['2021-06-02 00:00:00', '2021-06-03 00:00:00', '2021-06-04 
00:00:00']{noformat}

Filtering the dataset using any of the non-integer partition keys along with 
string-valued operands works as expected:

{code:python}
df2=pd.read_parquet('./test.parquet', filters=[('key4','=','1.1'), ('key5', 
'=', 'True')])
df2.head()
{code}

Output:
{noformat}
        data    key1    key2    key3    key4    key5    key6
0       foo     0       0       a       1.1     True    2021-06-02 00:00:00
{noformat}

However, filtering the dataset using either of the integer-valued partition 
keys with a string-valued operand raises an exception, *even when the original 
column's data type is string*:

{code:python}
df2=pd.read_parquet('./test.parquet', filters=[('key1','=','1')])
df2.head()
{code}

{noformat}
ArrowNotImplementedError: Function equal has no kernel matching input types 
(array[int32], scalar[string])
{noformat}

It would seem to be less surprising / more consistent if filter operands either 
(a) are always cast to string, or (b) always retain their original type.

Note, this issue may be related to ARROW-12114.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to