[jira] [Commented] (ARROW-15312) [R][C++] filtering a Parquet dataset with is.na() misses some rows

David Li (Jira) Sat, 12 Feb 2022 10:17:05 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-15312?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17491425#comment-17491425
 ]


David Li commented on ARROW-15312:
----------------------------------

I think basically we need to pick up and finish 
https://github.com/apache/arrow/pull/10253. We should also rigorously construct 
and document how we want to treat guarantees since we're playing fast and loose 
with them right now and that is leading to wrong results like this.

CC [~westonpace], [~bkietz]

> [R][C++] filtering a Parquet dataset with is.na() misses some rows
> ------------------------------------------------------------------
>
>                 Key: ARROW-15312
>                 URL: https://issues.apache.org/jira/browse/ARROW-15312
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 6.0.1
>         Environment: R 4.1.2 on Windows
> arrow 6.0.1
> dplyr 1.0.7
>            Reporter: Pierre Gramme
>            Priority: Major
>             Fix For: 7.0.1, 8.0.0
>
>
> Hi !
> I just found an issue when querying an Arrow dataset with dplyr, filtering on 
> is.na(...)
> It seems linked to columns containing only one distinct value and some NA's.
> Can you also reproduce the following?
>  
> {code:java}
>   library(arrow)
>   library(dplyr)
>   
>   ds_path = "test-arrow-na"
>   df = tibble(x=1:3, y=c(0L, 0L, NA_integer_), z=c(0L, 1L, NA_integer_))
>   
>   df %>% arrow::write_dataset(ds_path)
>   
>   # OK: Collect then filter: returns row 3, as expected
>   arrow::open_dataset(ds_path) %>% collect() %>% filter(is.na(y))
>   # ERROR: Filter then collect (on y) returns a tibble with no row
>   arrow::open_dataset(ds_path) %>% filter(is.na(y)) %>% collect()
>   
>   # OK: Filter then collect (on z) returns row 3, as expected
>   arrow::open_dataset(ds_path) %>% filter(is.na(z)) %>% collect() {code}
>  
> Thanks
> Pierre



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ARROW-15312) [R][C++] filtering a Parquet dataset with is.na() misses some rows

Reply via email to