[ 
https://issues.apache.org/jira/browse/ARROW-16045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Damian Barabonkov updated ARROW-16045:
--------------------------------------
    Description: 
This issue is present in pyarrow v7.0.0, but not in v6.0.1.

Pyarrow errors when attempting to read from a parquet file with an empty filter 
on a string and categorical column. These are columns "E" and "F". 
Interestingly the issue is not present in v7.0.0 when reading from a float, 
timestamp or integer column ("A" through "D").

 

The following Python code presents a minimal example which reproduces the issue:
{code:python}
import pandas as pd
import numpy as np
path = './example_df.parquet'
df = pd.DataFrame(
    {
        "A": 1.0,
        "B": pd.Timestamp("20130102"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
)
df.to_parquet(path)

# Works!
df_read = pd.read_parquet(
    path,
    filters=[
        [
            ("A", "in", set())
        ]
    ]
)

# Pyarrow v6.0.1 and v7.0.0
#
# Empty DataFrame
# Columns: [A, B, C, D, E, F]
# Index: []
print(df_read)

# Fails!
df_read = pd.read_parquet(
    path,
    filters=[
        [
            ("F", "in", set())
        ]
    ]
)
# Pyarrow v6.0.1
#
# Empty DataFrame
# Columns: [A, B, C, D, E, F]
# Index: []

# Pyarrow v7.0.0
#
# pyarrow.lib.ArrowInvalid: Array type didn't match type of values set: string 
vs null
print(df_read) {code}

  was:
This issue is present in pyarrow v7.0.0, but not in v6.0.1.

Pyarrow errors when attempting to read from a parquet file with an empty filter 
on a string column. Also, interestingly the issue is not present when reading 
from a float column (in v7.0.0 as well).

 

The following Python code presents a minimal example which reproduces the issue:
{code:python}
import pandas as pd
import numpy as np
path = './example_df.parquet'
df = pd.DataFrame(
    {
        "A": 1.0,
        "B": pd.Timestamp("20130102"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
)
df.to_parquet(path)

# Works!
df_read = pd.read_parquet(
    path,
    filters=[
        [
            ("A", "in", set())
        ]
    ]
)

# Pyarrow v6.0.1 and v7.0.0
#
# Empty DataFrame
# Columns: [A, B, C, D, E, F]
# Index: []
print(df_read)

# Fails!
df_read = pd.read_parquet(
    path,
    filters=[
        [
            ("F", "in", set())
        ]
    ]
)
# Pyarrow v6.0.1
#
# Empty DataFrame
# Columns: [A, B, C, D, E, F]
# Index: []

# Pyarrow v7.0.0
#
# pyarrow.lib.ArrowInvalid: Array type didn't match type of values set: string 
vs null
print(df_read) {code}


> Version=7.0.0 introduces bug when filtering by empty set during load
> --------------------------------------------------------------------
>
>                 Key: ARROW-16045
>                 URL: https://issues.apache.org/jira/browse/ARROW-16045
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 7.0.0
>         Environment: pandas                    1.3.5
> pyarrow                   7.0.0
> python                    3.10.4
>            Reporter: Damian Barabonkov
>            Priority: Major
>             Fix For: 6.0.1
>
>
> This issue is present in pyarrow v7.0.0, but not in v6.0.1.
> Pyarrow errors when attempting to read from a parquet file with an empty 
> filter on a string and categorical column. These are columns "E" and "F". 
> Interestingly the issue is not present in v7.0.0 when reading from a float, 
> timestamp or integer column ("A" through "D").
>  
> The following Python code presents a minimal example which reproduces the 
> issue:
> {code:python}
> import pandas as pd
> import numpy as np
> path = './example_df.parquet'
> df = pd.DataFrame(
>     {
>         "A": 1.0,
>         "B": pd.Timestamp("20130102"),
>         "C": pd.Series(1, index=list(range(4)), dtype="float32"),
>         "D": np.array([3] * 4, dtype="int32"),
>         "E": pd.Categorical(["test", "train", "test", "train"]),
>         "F": "foo",
>     }
> )
> df.to_parquet(path)
> # Works!
> df_read = pd.read_parquet(
>     path,
>     filters=[
>         [
>             ("A", "in", set())
>         ]
>     ]
> )
> # Pyarrow v6.0.1 and v7.0.0
> #
> # Empty DataFrame
> # Columns: [A, B, C, D, E, F]
> # Index: []
> print(df_read)
> # Fails!
> df_read = pd.read_parquet(
>     path,
>     filters=[
>         [
>             ("F", "in", set())
>         ]
>     ]
> )
> # Pyarrow v6.0.1
> #
> # Empty DataFrame
> # Columns: [A, B, C, D, E, F]
> # Index: []
> # Pyarrow v7.0.0
> #
> # pyarrow.lib.ArrowInvalid: Array type didn't match type of values set: 
> string vs null
> print(df_read) {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to