[jira] [Comment Edited] (ARROW-15312) [R][C++] filtering a Parquet dataset with is.na() misses some rows

David Li (Jira) Sat, 12 Feb 2022 10:19:05 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-15312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17491423#comment-17491423
 ]


David Li edited comment on ARROW-15312 at 2/12/22, 6:18 PM:
------------------------------------------------------------

I did some debugging for the projection pushdown. With statistics enabled, the 
row group has the guarantee {{(col_with_na_and_one_val == 0)}} and the 
predicate being tested is {{is_null(col_with_na_and_one_val, 
\{nan_is_null=true})}}. The way simplification works is that there's a pass 
that extracts a map of field-value pairs from equality expressions. Hence, the 
simplification pass thinks that {{col_with_na_and_one_val}} can only be zero, 
i.e. it can't be null, and {{is_null(col_with_na_and_one_val)}} gets simplified 
to {{false}} and the row group is pruned. The main issue is then that we aren't 
properly accounting for nullability in both generating guarantees based on row 
group statistics and in simplification.

(FWIW, this doesn't affect any other formats since we don't have predicate 
pushdown in the first place.)


was (Author: lidavidm):
I did some debugging for the projection pushdown. With statistics enabled, the 
row group has the guarantee {{(col_with_na_and_one_val == 0)}} and the 
predicate being tested is {{is_null(col_with_na_and_one_val, 
\{nan_is_null=true})}}. The way simplification works is that there's a pass 
that extracts a map of field-value pairs from equality expressions. Hence, the 
simplification pass thinks that {{col_with_na_and_one_val}} can only be zero, 
i.e. it can't be null, and {{is_null(col_with_na_and_one_val)}} gets simplified 
to {{false}} and the row group is pruned. The main issue is then that we aren't 
properly accounting for nullability in both generating guarantees based on row 
group statistics and in simplification.

> [R][C++] filtering a Parquet dataset with is.na() misses some rows
> ------------------------------------------------------------------
>
>                 Key: ARROW-15312
>                 URL: https://issues.apache.org/jira/browse/ARROW-15312
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 6.0.1
>         Environment: R 4.1.2 on Windows
> arrow 6.0.1
> dplyr 1.0.7
>            Reporter: Pierre Gramme
>            Priority: Major
>              Labels: dataset
>             Fix For: 7.0.1, 8.0.0
>
>
> Hi !
> I just found an issue when querying an Arrow dataset with dplyr, filtering on 
> is.na(...)
> It seems linked to columns containing only one distinct value and some NA's.
> Can you also reproduce the following?
>  
> {code:java}
>   library(arrow)
>   library(dplyr)
>   
>   ds_path = "test-arrow-na"
>   df = tibble(x=1:3, y=c(0L, 0L, NA_integer_), z=c(0L, 1L, NA_integer_))
>   
>   df %>% arrow::write_dataset(ds_path)
>   
>   # OK: Collect then filter: returns row 3, as expected
>   arrow::open_dataset(ds_path) %>% collect() %>% filter(is.na(y))
>   # ERROR: Filter then collect (on y) returns a tibble with no row
>   arrow::open_dataset(ds_path) %>% filter(is.na(y)) %>% collect()
>   
>   # OK: Filter then collect (on z) returns row 3, as expected
>   arrow::open_dataset(ds_path) %>% filter(is.na(z)) %>% collect() {code}
>  
> Thanks
> Pierre



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Comment Edited] (ARROW-15312) [R][C++] filtering a Parquet dataset with is.na() misses some rows

Reply via email to