[
https://issues.apache.org/jira/browse/ARROW-16045?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Damian Barabonkov updated ARROW-16045:
--------------------------------------
Description:
This issue is present in pyarrow v7.0.0, but not in v6.0.1.
Pyarrow errors when attempting to read from a parquet file with an empty filter
on a string column. Also, interestingly the issue is not present when reading
from a float column (in v7.0.0 as well).
The following Python code presents a minimal example which reproduces the issue:
{code:python}
import pandas as pd
import numpy as np
path = './example_df.parquet'
df = pd.DataFrame(
{
"A": 1.0,
"B": pd.Timestamp("20130102"),
"C": pd.Series(1, index=list(range(4)), dtype="float32"),
"D": np.array([3] * 4, dtype="int32"),
"E": pd.Categorical(["test", "train", "test", "train"]),
"F": "foo",
}
)
df.to_parquet(path)
# Works!
df_read = pd.read_parquet(
path,
filters=[
[
("A", "in", set())
]
]
)
# Pyarrow v6.0.1 and v7.0.0
#
# Empty DataFrame
# Columns: [A, B, C, D, E, F]
# Index: []
print(df_read)
# Fails!
df_read = pd.read_parquet(
path,
filters=[
[
("F", "in", set())
]
]
)
# Pyarrow v6.0.1
#
# Empty DataFrame
# Columns: [A, B, C, D, E, F]
# Index: []
# Pyarrow v7.0.0
#
# pyarrow.lib.ArrowInvalid: Array type didn't match type of values set: string
vs null
print(df_read) {code}
was:
Pyarrow errors when attempting to read from a parquet file with an empty filter
on a string column. This issue is present in pyarrow v7.0.0, but not in v6.0.1.
Also, interestingly the issue is not present when reading from an integer
column (in v7.0.0 as well).
The following Python code presents a minimal example which reproduces the issue:
{code:python}
import pandas as pd
import numpy as np
path = './example_df.parquet'
df = pd.DataFrame(
{
"A": 1.0,
"B": pd.Timestamp("20130102"),
"C": pd.Series(1, index=list(range(4)), dtype="float32"),
"D": np.array([3] * 4, dtype="int32"),
"E": pd.Categorical(["test", "train", "test", "train"]),
"F": "foo",
}
)
df.to_parquet(path)
# Works!
df_read = pd.read_parquet(
path,
filters=[
[
("A", "in", set())
]
]
)
# Pyarrow v6.0.1 and v7.0.0
#
# Empty DataFrame
# Columns: [A, B, C, D, E, F]
# Index: []
print(df_read)
# Fails!
df_read = pd.read_parquet(
path,
filters=[
[
("F", "in", set())
]
]
)
# Pyarrow v6.0.1
#
# Empty DataFrame
# Columns: [A, B, C, D, E, F]
# Index: []
# Pyarrow v7.0.0
#
# pyarrow.lib.ArrowInvalid: Array type didn't match type of values set: string
vs null
print(df_read) {code}
> Version=7.0.0 introduces bug when filtering by empty set during load
> --------------------------------------------------------------------
>
> Key: ARROW-16045
> URL: https://issues.apache.org/jira/browse/ARROW-16045
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 7.0.0
> Environment: pandas 1.3.5
> pyarrow 7.0.0
> python 3.10.4
> Reporter: Damian Barabonkov
> Priority: Major
> Fix For: 6.0.1
>
>
> This issue is present in pyarrow v7.0.0, but not in v6.0.1.
> Pyarrow errors when attempting to read from a parquet file with an empty
> filter on a string column. Also, interestingly the issue is not present when
> reading from a float column (in v7.0.0 as well).
>
> The following Python code presents a minimal example which reproduces the
> issue:
> {code:python}
> import pandas as pd
> import numpy as np
> path = './example_df.parquet'
> df = pd.DataFrame(
> {
> "A": 1.0,
> "B": pd.Timestamp("20130102"),
> "C": pd.Series(1, index=list(range(4)), dtype="float32"),
> "D": np.array([3] * 4, dtype="int32"),
> "E": pd.Categorical(["test", "train", "test", "train"]),
> "F": "foo",
> }
> )
> df.to_parquet(path)
> # Works!
> df_read = pd.read_parquet(
> path,
> filters=[
> [
> ("A", "in", set())
> ]
> ]
> )
> # Pyarrow v6.0.1 and v7.0.0
> #
> # Empty DataFrame
> # Columns: [A, B, C, D, E, F]
> # Index: []
> print(df_read)
> # Fails!
> df_read = pd.read_parquet(
> path,
> filters=[
> [
> ("F", "in", set())
> ]
> ]
> )
> # Pyarrow v6.0.1
> #
> # Empty DataFrame
> # Columns: [A, B, C, D, E, F]
> # Index: []
> # Pyarrow v7.0.0
> #
> # pyarrow.lib.ArrowInvalid: Array type didn't match type of values set:
> string vs null
> print(df_read) {code}
--
This message was sent by Atlassian Jira
(v8.20.1#820001)