Damian Barabonkov created ARROW-16045:
-----------------------------------------

             Summary: Version=7.0.0 introduces bug when filtering by empty set 
during load
                 Key: ARROW-16045
                 URL: https://issues.apache.org/jira/browse/ARROW-16045
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 7.0.0
         Environment: pandas                    1.3.5
pyarrow                   7.0.0
python                    3.10.4

            Reporter: Damian Barabonkov
             Fix For: 6.0.1


Pyarrow errors when attempting to read from a parquet file with an empty filter 
on a string column. This issue is present in pyarrow v7.0.0, but not in v6.0.1. 
Also, interestingly the issue is not present when reading from an integer 
column (in v7.0.0 as well).

 

The following Python code presents a minimal example which reproduces the issue:
{code:python}
import pandas as pd
import numpy as np
path = './example_df.parquet'
df = pd.DataFrame(
    {
        "A": 1.0,
        "B": pd.Timestamp("20130102"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
)
df.to_parquet(path)

# Works!
df_read = pd.read_parquet(
    path,
    filters=[
        [
            ("A", "in", set())
        ]
    ]
)

# Pyarrow v6.0.1 and v7.0.0
#
# Empty DataFrame
# Columns: [A, B, C, D, E, F]
# Index: []
print(df_read)

# Fails!
df_read = pd.read_parquet(
    path,
    filters=[
        [
            ("F", "in", set())
        ]
    ]
)
# Pyarrow v6.0.1
#
# Empty DataFrame
# Columns: [A, B, C, D, E, F]
# Index: []

# Pyarrow v7.0.0
#
# pyarrow.lib.ArrowInvalid: Array type didn't match type of values set: string 
vs null
print(df_read) {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to