[
https://issues.apache.org/jira/browse/ARROW-16045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Damian Barabonkov updated ARROW-16045:
--------------------------------------
Description:
This issue is present in pyarrow v7.0.0, but not in v6.0.1.
Pyarrow errors when attempting to read from a parquet file with an empty filter
on a string and categorical column. These are columns "E" and "F".
Interestingly the issue is not present in v7.0.0 when reading from a float,
timestamp or integer column ("A" through "D").
The following Python code presents a minimal example which reproduces the issue:
{code:python}
import pandas as pd
import numpy as np
path = './example_df.parquet'
df = pd.DataFrame(
{
"A": 1.0,
"B": pd.Timestamp("20130102"),
"C": pd.Series(1, index=list(range(4)), dtype="float32"),
"D": np.array([3] * 4, dtype="int32"),
"E": pd.Categorical(["test", "train", "test", "train"]),
"F": "foo",
}
)
df.to_parquet(path)
# Works!
df_read = pd.read_parquet(
path,
filters=[
[
("A", "in", set())
]
]
)
# Pyarrow v6.0.1 and v7.0.0
#
# Empty DataFrame
# Columns: [A, B, C, D, E, F]
# Index: []
print(df_read)
# Fails!
df_read = pd.read_parquet(
path,
filters=[
[
("F", "in", set())
]
]
)
# Pyarrow v6.0.1
#
# Empty DataFrame
# Columns: [A, B, C, D, E, F]
# Index: []
# Pyarrow v7.0.0
#
# pyarrow.lib.ArrowInvalid: Array type didn't match type of values set: string
vs null
print(df_read) {code}
was:
This issue is present in pyarrow v7.0.0, but not in v6.0.1.
Pyarrow errors when attempting to read from a parquet file with an empty filter
on a string column. Also, interestingly the issue is not present when reading
from a float column (in v7.0.0 as well).
The following Python code presents a minimal example which reproduces the issue:
{code:python}
import pandas as pd
import numpy as np
path = './example_df.parquet'
df = pd.DataFrame(
{
"A": 1.0,
"B": pd.Timestamp("20130102"),
"C": pd.Series(1, index=list(range(4)), dtype="float32"),
"D": np.array([3] * 4, dtype="int32"),
"E": pd.Categorical(["test", "train", "test", "train"]),
"F": "foo",
}
)
df.to_parquet(path)
# Works!
df_read = pd.read_parquet(
path,
filters=[
[
("A", "in", set())
]
]
)
# Pyarrow v6.0.1 and v7.0.0
#
# Empty DataFrame
# Columns: [A, B, C, D, E, F]
# Index: []
print(df_read)
# Fails!
df_read = pd.read_parquet(
path,
filters=[
[
("F", "in", set())
]
]
)
# Pyarrow v6.0.1
#
# Empty DataFrame
# Columns: [A, B, C, D, E, F]
# Index: []
# Pyarrow v7.0.0
#
# pyarrow.lib.ArrowInvalid: Array type didn't match type of values set: string
vs null
print(df_read) {code}
> Version=7.0.0 introduces bug when filtering by empty set during load
> --------------------------------------------------------------------
>
> Key: ARROW-16045
> URL: https://issues.apache.org/jira/browse/ARROW-16045
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 7.0.0
> Environment: pandas 1.3.5
> pyarrow 7.0.0
> python 3.10.4
> Reporter: Damian Barabonkov
> Priority: Major
> Fix For: 6.0.1
>
>
> This issue is present in pyarrow v7.0.0, but not in v6.0.1.
> Pyarrow errors when attempting to read from a parquet file with an empty
> filter on a string and categorical column. These are columns "E" and "F".
> Interestingly the issue is not present in v7.0.0 when reading from a float,
> timestamp or integer column ("A" through "D").
>
> The following Python code presents a minimal example which reproduces the
> issue:
> {code:python}
> import pandas as pd
> import numpy as np
> path = './example_df.parquet'
> df = pd.DataFrame(
> {
> "A": 1.0,
> "B": pd.Timestamp("20130102"),
> "C": pd.Series(1, index=list(range(4)), dtype="float32"),
> "D": np.array([3] * 4, dtype="int32"),
> "E": pd.Categorical(["test", "train", "test", "train"]),
> "F": "foo",
> }
> )
> df.to_parquet(path)
> # Works!
> df_read = pd.read_parquet(
> path,
> filters=[
> [
> ("A", "in", set())
> ]
> ]
> )
> # Pyarrow v6.0.1 and v7.0.0
> #
> # Empty DataFrame
> # Columns: [A, B, C, D, E, F]
> # Index: []
> print(df_read)
> # Fails!
> df_read = pd.read_parquet(
> path,
> filters=[
> [
> ("F", "in", set())
> ]
> ]
> )
> # Pyarrow v6.0.1
> #
> # Empty DataFrame
> # Columns: [A, B, C, D, E, F]
> # Index: []
> # Pyarrow v7.0.0
> #
> # pyarrow.lib.ArrowInvalid: Array type didn't match type of values set:
> string vs null
> print(df_read) {code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)