[ 
https://issues.apache.org/jira/browse/ARROW-8216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17067902#comment-17067902
 ] 

Neal Richardson commented on ARROW-8216:
----------------------------------------

Thanks for the report. I did some exploration on this.

1. The resulting data.frames are different because the one from the dataset 
scan is including rows of all NA for the rows where {{hair_color}} is missing. 

{code}
> identical(df_dataset[!is.na(df_dataset$hair_color),], df_mem)
[1] TRUE
{code}

2. The fact that {{hair_color}} has empty strings is irrelevant, as is the fact 
that it is a string column. Here's a simpler example:

{code}
> library(arrow)
> library(dplyr)
> dir <- tempdir()
> fpath <- file.path(dir, "data.parquet")
> 
> df <- data.frame(a=1:3, int=c(NA, 4L, 5L), dbl=c(5.0, NA, 6), str=c("a", "b", 
> NA), stringsAsFactors=FALSE)
> df
  a int dbl  str
1 1  NA   5    a
2 2   4  NA    b
3 3   5   6 <NA>
> write_parquet(df, fpath)
> ds <- open_dataset(dir)
> ds %>% filter(int > 4) %>% collect()
# A tibble: 2 x 4
      a   int   dbl str  
  <int> <int> <dbl> <chr>
1    NA    NA    NA NA   
2     3     5     6 NA   
> ds %>% filter(dbl == 5) %>% collect()
# A tibble: 2 x 4
      a   int   dbl str  
  <int> <int> <dbl> <chr>
1     1    NA     5 a    
2    NA    NA    NA NA   
> ds %>% filter(str == "a") %>% collect()
# A tibble: 2 x 4
      a   int   dbl str  
  <int> <int> <dbl> <chr>
1     1    NA     5 a    
2    NA    NA    NA NA   
> ds %>% filter(str == "d") %>% collect()
# A tibble: 1 x 4
      a   int   dbl str  
  <int> <int> <dbl> <chr>
1    NA    NA    NA NA   
{code}

3. In terms of what _should_ happen, on the one hand, matching what {{dplyr}} 
does is good; on the other, one could conceptually argue that if I filter where 
{{int > 4}}, I should keep the rows where int is NA because we don't know 
whether or not they are > 4. (But that's not what this is doing here: it's 
filling everything in the rows with NA.) So maybe this should be some option?

4. Interestingly, this behavior _is_ consistent with how base R handles 
extracting rows with NA in the selection vector:

{code}
> df[df$int > 4,]
    a int dbl  str
NA NA  NA  NA <NA>
3   3   5   6 <NA>
# Because:
> df$int > 4
[1]    NA FALSE  TRUE
{code}

> [R] filter method for Dataset doesn't distinguish between empty strings and 
> NAs
> -------------------------------------------------------------------------------
>
>                 Key: ARROW-8216
>                 URL: https://issues.apache.org/jira/browse/ARROW-8216
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: R
>    Affects Versions: 0.16.0
>         Environment: R 3.6.3, Windows 10
>            Reporter: Sam Albers
>            Priority: Minor
>             Fix For: 0.17.0
>
>
>  
> I have just noticed some slightly odd behaviour with the filter method for 
> Dataset. 
>  
> {code:java}
> library(arrow)
> library(dplyr)
> packageVersion("arrow")
> #> [1] '0.16.0.20200323'
> ## Make sample parquet
> starwars$hair_color[starwars$hair_color == "brown"] <- ""
> dir <- tempdir()
> fpath <- file.path(dir, "data.parquet")
> write_parquet(starwars, fpath)
> ## df in memory
> df_mem <- starwars %>%
>  filter(hair_color == "")
> ## reading from the parquet
> df_parquet <- read_parquet(fpath) %>%
>  filter(hair_color == "")
> ## using open_dataset
> df_dataset <- open_dataset(dir) %>%
>  filter(hair_color == "") %>%
>  collect()
> identical(df_mem, df_parquet)
> #> [1] TRUE
> identical(df_mem, df_dataset)
> #> [1] FALSE
> {code}
>  
>  
> I'm pretty sure all these should return the same data.frame. Am I missing 
> something?
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to