[
https://issues.apache.org/jira/browse/ARROW-16495?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17533027#comment-17533027
]
Nick Riasanovsky commented on ARROW-16495:
------------------------------------------
For additional context, `scanner.to_table()`{{ does read the}} data correct
with ~ is_null, but not is_null, so its just a count_rows issue.
{code:java}
expr = ds.field("C").is_null()
scanner = fragment.scanner(filter=expr)
print(scanner.to_table()){code}
Outputs:
{code:java}
C: string
----
C: [[]]{code}
While
{code:java}
expr = ds.field("C").is_null()
scanner = fragment.scanner(filter=~expr)
print(scanner.to_table()) {code}
Outputs
{code:java}
pyarrow.Table
C: string
----
C: [["A"]]{code}
> [Python] Scanner.count_rows() doesn't properly handle null expressions
> ----------------------------------------------------------------------
>
> Key: ARROW-16495
> URL: https://issues.apache.org/jira/browse/ARROW-16495
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 7.0.0
> Reporter: Nick Riasanovsky
> Priority: Major
>
> Passing an expression filter with `is_null()` doesn't properly remove null
> values, when computing row counts. I have reproduced this with both strings
> and integer. Here is a reproducer.
>
>
>
> {code:java}
> df = pd.DataFrame({"C": pd.array([None, None, 1], dtype=pd.Int64Dtype())})
> print(df)
> df.to_parquet("test.pq")
>
> # Create a dataset
> dataset = ds.dataset("test.pq")
> fragments = [f for f in dataset.get_fragments()]
> #There should just be 1 fragment.
> fragment = fragments[0]
> # Get the null row count
> expr = ds.field("C").is_null()
> scanner = fragment.scanner(filter=expr)
> print(scanner.count_rows())
> {code}
>
>
> I expect this print 2 as there are 2 NULL values.
--
This message was sent by Atlassian Jira
(v8.20.7#820007)