Nick Riasanovsky created ARROW-16495:
----------------------------------------
Summary: [Python] Scanner.count_rows() doesn't properly handle
null expressions
Key: ARROW-16495
URL: https://issues.apache.org/jira/browse/ARROW-16495
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 7.0.0
Reporter: Nick Riasanovsky
Passing an expression filter with `is_null()` doesn't properly remove null
values, when computing row counts. I have reproduced this with both strings and
integer. Here is a reproducer.
```python
df = pd.DataFrame(\{"C": pd.array([None, None, 1], dtype=pd.Int64Dtype())})
print(df)N
df.to_parquet("test.pq")
# Create a dataset
dataset = ds.dataset("test.pq")
fragments = [f for f in dataset.get_fragments()]
# There should just be 1 fragment.
fragment = fragments[0]
# Get the null row count
expr = ds.field("C").is_null()
scanner = fragment.scanner(filter=expr)
print(scanner.count_rows())
```
I expect this print 2 as there are 2 NULL values.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)