[jira] [Updated] (ARROW-16045) Version=7.0.0 introduces bug when filtering by empty set during load

Damian Barabonkov (Jira) Mon, 28 Mar 2022 06:15:04 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-16045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Damian Barabonkov updated ARROW-16045:
--------------------------------------
    Description: 
This issue is present in pyarrow v7.0.0, but not in v6.0.1.

Pyarrow errors when attempting to read from a parquet file with an empty filter 
on a string column. Also, interestingly the issue is not present when reading 
from a float column (in v7.0.0 as well).

 

The following Python code presents a minimal example which reproduces the issue:
{code:python}
import pandas as pd
import numpy as np
path = './example_df.parquet'
df = pd.DataFrame(
    {
        "A": 1.0,
        "B": pd.Timestamp("20130102"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
)
df.to_parquet(path)

# Works!
df_read = pd.read_parquet(
    path,
    filters=[
        [
            ("A", "in", set())
        ]
    ]
)

# Pyarrow v6.0.1 and v7.0.0
#
# Empty DataFrame
# Columns: [A, B, C, D, E, F]
# Index: []
print(df_read)

# Fails!
df_read = pd.read_parquet(
    path,
    filters=[
        [
            ("F", "in", set())
        ]
    ]
)
# Pyarrow v6.0.1
#
# Empty DataFrame
# Columns: [A, B, C, D, E, F]
# Index: []

# Pyarrow v7.0.0
#
# pyarrow.lib.ArrowInvalid: Array type didn't match type of values set: string 
vs null
print(df_read) {code}

  was:
Pyarrow errors when attempting to read from a parquet file with an empty filter 
on a string column. This issue is present in pyarrow v7.0.0, but not in v6.0.1. 
Also, interestingly the issue is not present when reading from an integer 
column (in v7.0.0 as well).

 

The following Python code presents a minimal example which reproduces the issue:
{code:python}
import pandas as pd
import numpy as np
path = './example_df.parquet'
df = pd.DataFrame(
    {
        "A": 1.0,
        "B": pd.Timestamp("20130102"),
        "C": pd.Series(1, index=list(range(4)), dtype="float32"),
        "D": np.array([3] * 4, dtype="int32"),
        "E": pd.Categorical(["test", "train", "test", "train"]),
        "F": "foo",
    }
)
df.to_parquet(path)

# Works!
df_read = pd.read_parquet(
    path,
    filters=[
        [
            ("A", "in", set())
        ]
    ]
)

# Pyarrow v6.0.1 and v7.0.0
#
# Empty DataFrame
# Columns: [A, B, C, D, E, F]
# Index: []
print(df_read)

# Fails!
df_read = pd.read_parquet(
    path,
    filters=[
        [
            ("F", "in", set())
        ]
    ]
)
# Pyarrow v6.0.1
#
# Empty DataFrame
# Columns: [A, B, C, D, E, F]
# Index: []

# Pyarrow v7.0.0
#
# pyarrow.lib.ArrowInvalid: Array type didn't match type of values set: string 
vs null
print(df_read) {code}


> Version=7.0.0 introduces bug when filtering by empty set during load
> --------------------------------------------------------------------
>
>                 Key: ARROW-16045
>                 URL: https://issues.apache.org/jira/browse/ARROW-16045
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 7.0.0
>         Environment: pandas                    1.3.5
> pyarrow                   7.0.0
> python                    3.10.4
>            Reporter: Damian Barabonkov
>            Priority: Major
>             Fix For: 6.0.1
>
>
> This issue is present in pyarrow v7.0.0, but not in v6.0.1.
> Pyarrow errors when attempting to read from a parquet file with an empty 
> filter on a string column. Also, interestingly the issue is not present when 
> reading from a float column (in v7.0.0 as well).
>  
> The following Python code presents a minimal example which reproduces the 
> issue:
> {code:python}
> import pandas as pd
> import numpy as np
> path = './example_df.parquet'
> df = pd.DataFrame(
>     {
>         "A": 1.0,
>         "B": pd.Timestamp("20130102"),
>         "C": pd.Series(1, index=list(range(4)), dtype="float32"),
>         "D": np.array([3] * 4, dtype="int32"),
>         "E": pd.Categorical(["test", "train", "test", "train"]),
>         "F": "foo",
>     }
> )
> df.to_parquet(path)
> # Works!
> df_read = pd.read_parquet(
>     path,
>     filters=[
>         [
>             ("A", "in", set())
>         ]
>     ]
> )
> # Pyarrow v6.0.1 and v7.0.0
> #
> # Empty DataFrame
> # Columns: [A, B, C, D, E, F]
> # Index: []
> print(df_read)
> # Fails!
> df_read = pd.read_parquet(
>     path,
>     filters=[
>         [
>             ("F", "in", set())
>         ]
>     ]
> )
> # Pyarrow v6.0.1
> #
> # Empty DataFrame
> # Columns: [A, B, C, D, E, F]
> # Index: []
> # Pyarrow v7.0.0
> #
> # pyarrow.lib.ArrowInvalid: Array type didn't match type of values set: 
> string vs null
> print(df_read) {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ARROW-16045) Version=7.0.0 introduces bug when filtering by empty set during load

Reply via email to