[
https://issues.apache.org/jira/browse/ARROW-15312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17491423#comment-17491423
]
David Li commented on ARROW-15312:
----------------------------------
I did some debugging for the projection pushdown. With statistics enabled, the
row group has the guarantee {{(col_with_na_and_one_val == 0)}} and the
predicate being tested is {{is_null(col_with_na_and_one_val,
\{nan_is_null=true})}}. The way simplification works is that there's a pass
that extracts a map of field-value pairs from equality expressions. Hence, the
simplification pass thinks that {{col_with_na_and_one_val}} can only be zero,
i.e. it can't be null, and {{is_null(col_with_na_and_one_val)}} gets simplified
to {{false}} and the row group is pruned. The main issue is then that we aren't
properly accounting for nullability in both generating guarantees based on row
group statistics and in simplification.
> [R][C++] filtering a Parquet dataset with is.na() misses some rows
> ------------------------------------------------------------------
>
> Key: ARROW-15312
> URL: https://issues.apache.org/jira/browse/ARROW-15312
> Project: Apache Arrow
> Issue Type: Bug
> Components: R
> Affects Versions: 6.0.1
> Environment: R 4.1.2 on Windows
> arrow 6.0.1
> dplyr 1.0.7
> Reporter: Pierre Gramme
> Priority: Major
> Fix For: 7.0.1, 8.0.0
>
>
> Hi !
> I just found an issue when querying an Arrow dataset with dplyr, filtering on
> is.na(...)
> It seems linked to columns containing only one distinct value and some NA's.
> Can you also reproduce the following?
>
> {code:java}
> library(arrow)
> library(dplyr)
>
> ds_path = "test-arrow-na"
> df = tibble(x=1:3, y=c(0L, 0L, NA_integer_), z=c(0L, 1L, NA_integer_))
>
> df %>% arrow::write_dataset(ds_path)
>
> # OK: Collect then filter: returns row 3, as expected
> arrow::open_dataset(ds_path) %>% collect() %>% filter(is.na(y))
> # ERROR: Filter then collect (on y) returns a tibble with no row
> arrow::open_dataset(ds_path) %>% filter(is.na(y)) %>% collect()
>
> # OK: Filter then collect (on z) returns row 3, as expected
> arrow::open_dataset(ds_path) %>% filter(is.na(z)) %>% collect() {code}
>
> Thanks
> Pierre
--
This message was sent by Atlassian Jira
(v8.20.1#820001)