[
https://issues.apache.org/jira/browse/ARROW-15312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17491423#comment-17491423
]
David Li edited comment on ARROW-15312 at 2/12/22, 6:18 PM:
------------------------------------------------------------
I did some debugging for the projection pushdown. With statistics enabled, the
row group has the guarantee {{(col_with_na_and_one_val == 0)}} and the
predicate being tested is {{is_null(col_with_na_and_one_val,
\{nan_is_null=true})}}. The way simplification works is that there's a pass
that extracts a map of field-value pairs from equality expressions. Hence, the
simplification pass thinks that {{col_with_na_and_one_val}} can only be zero,
i.e. it can't be null, and {{is_null(col_with_na_and_one_val)}} gets simplified
to {{false}} and the row group is pruned. The main issue is then that we aren't
properly accounting for nullability in both generating guarantees based on row
group statistics and in simplification.
(FWIW, this doesn't affect any other formats since we don't have predicate
pushdown in the first place.)
was (Author: lidavidm):
I did some debugging for the projection pushdown. With statistics enabled, the
row group has the guarantee {{(col_with_na_and_one_val == 0)}} and the
predicate being tested is {{is_null(col_with_na_and_one_val,
\{nan_is_null=true})}}. The way simplification works is that there's a pass
that extracts a map of field-value pairs from equality expressions. Hence, the
simplification pass thinks that {{col_with_na_and_one_val}} can only be zero,
i.e. it can't be null, and {{is_null(col_with_na_and_one_val)}} gets simplified
to {{false}} and the row group is pruned. The main issue is then that we aren't
properly accounting for nullability in both generating guarantees based on row
group statistics and in simplification.
> [R][C++] filtering a Parquet dataset with is.na() misses some rows
> ------------------------------------------------------------------
>
> Key: ARROW-15312
> URL: https://issues.apache.org/jira/browse/ARROW-15312
> Project: Apache Arrow
> Issue Type: Bug
> Components: R
> Affects Versions: 6.0.1
> Environment: R 4.1.2 on Windows
> arrow 6.0.1
> dplyr 1.0.7
> Reporter: Pierre Gramme
> Priority: Major
> Labels: dataset
> Fix For: 7.0.1, 8.0.0
>
>
> Hi !
> I just found an issue when querying an Arrow dataset with dplyr, filtering on
> is.na(...)
> It seems linked to columns containing only one distinct value and some NA's.
> Can you also reproduce the following?
>
> {code:java}
> library(arrow)
> library(dplyr)
>
> ds_path = "test-arrow-na"
> df = tibble(x=1:3, y=c(0L, 0L, NA_integer_), z=c(0L, 1L, NA_integer_))
>
> df %>% arrow::write_dataset(ds_path)
>
> # OK: Collect then filter: returns row 3, as expected
> arrow::open_dataset(ds_path) %>% collect() %>% filter(is.na(y))
> # ERROR: Filter then collect (on y) returns a tibble with no row
> arrow::open_dataset(ds_path) %>% filter(is.na(y)) %>% collect()
>
> # OK: Filter then collect (on z) returns row 3, as expected
> arrow::open_dataset(ds_path) %>% filter(is.na(z)) %>% collect() {code}
>
> Thanks
> Pierre
--
This message was sent by Atlassian Jira
(v8.20.1#820001)