annakrystalli commented on issue #14907:
URL: https://github.com/apache/arrow/issues/14907#issuecomment-1735674700

   Hello!
   
   I was wondering if this has been resolved as I'm still coming up against 
behaviour that differs from `dplyr` behaviour in both `left_join()` and 
`right_join()` (see reprex below) in `arrow` version 12.0.1.1. 
   
   What appears to me to be the difference is that while `dplyr` matches `NA` 
values by default, `arrow` does not seem to.
   
   Am I missing sth? 
   
   ``` r
   library(dplyr)
   #> 
   #> Attaching package: 'dplyr'
   #> The following objects are masked from 'package:stats':
   #> 
   #>     filter, lag
   #> The following objects are masked from 'package:base':
   #> 
   #>     intersect, setdiff, setequal, union
   library(arrow)
   #> 
   #> Attaching package: 'arrow'
   #> The following object is masked from 'package:utils':
   #> 
   #>     timestamp
   
   tbl1 <- tibble::tibble(
     a = 1:3,
     b = c("a", "b", NA),
     d = c(letters[4:6])
   )
   
   tbl2 <- tibble::tibble(
     b = c("b", NA),
     c = c("a should be 2", "a should be 3")
   )
   
   # Left join tibbles, NAs matched
   left_join(tbl2, tbl1)
   #> Joining with `by = join_by(b)`
   #> # A tibble: 2 × 4
   #>   b     c                 a d    
   #>   <chr> <chr>         <int> <chr>
   #> 1 b     a should be 2     2 e    
   #> 2 <NA>  a should be 3     3 f
   
   # Left join arrow table & tibble, NAs NOT matched
   left_join(as_arrow_table(tbl2), tbl1) %>% collect()
   #> # A tibble: 2 × 4
   #>   b     c                 a d    
   #>   <chr> <chr>         <int> <chr>
   #> 1 b     a should be 2     2 e    
   #> 2 <NA>  a should be 3    NA <NA>
   # Left join arrow table & arrow table, NAs NOT matched
   left_join(as_arrow_table(tbl2), as_arrow_table(tbl1)) %>% collect()
   #> # A tibble: 2 × 4
   #>   b     c                 a d    
   #>   <chr> <chr>         <int> <chr>
   #> 1 b     a should be 2     2 e    
   #> 2 <NA>  a should be 3    NA <NA>
   
   
   # Same with right_join. It appears that the NAs are the problem as arrow 
doesn't seem to count them as a match
   right_join(tbl1, tbl2)
   #> Joining with `by = join_by(b)`
   #> # A tibble: 2 × 4
   #>       a b     d     c            
   #>   <int> <chr> <chr> <chr>        
   #> 1     2 b     e     a should be 2
   #> 2     3 <NA>  f     a should be 3
   right_join(as_arrow_table(tbl1), as_arrow_table(tbl2)) %>% collect()
   #> # A tibble: 2 × 4
   #>       a b     d     c            
   #>   <int> <chr> <chr> <chr>        
   #> 1     2 b     e     a should be 2
   #> 2    NA <NA>  <NA>  a should be 3
   right_join(as_arrow_table(tbl1), tbl2) %>% collect()
   #> # A tibble: 2 × 4
   #>       a b     d     c            
   #>   <int> <chr> <chr> <chr>        
   #> 1    NA <NA>  <NA>  a should be 3
   #> 2     2 b     e     a should be 2
   ```
   
   <sup>Created on 2023-09-26 with [reprex 
v2.0.2](https://reprex.tidyverse.org)</sup>
   
   <details style="margin-bottom:10px;">
   <summary>
   Session info
   </summary>
   
   ``` r
   sessioninfo::session_info()
   #> ─ Session info 
───────────────────────────────────────────────────────────────
   #>  setting  value
   #>  version  R version 4.2.1 (2022-06-23)
   #>  os       macOS Ventura 13.5.2
   #>  system   aarch64, darwin20
   #>  ui       X11
   #>  language (EN)
   #>  collate  en_US.UTF-8
   #>  ctype    en_US.UTF-8
   #>  tz       Europe/Athens
   #>  date     2023-09-26
   #>  pandoc   3.1.1 @ 
/Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via 
rmarkdown)
   #> 
   #> ─ Packages 
───────────────────────────────────────────────────────────────────
   #>  package     * version  date (UTC) lib source
   #>  arrow       * 12.0.1.1 2023-07-18 [1] CRAN (R 4.2.0)
   #>  assertthat    0.2.1    2019-03-21 [1] CRAN (R 4.2.0)
   #>  bit           4.0.5    2022-11-15 [1] CRAN (R 4.2.0)
   #>  bit64         4.0.5    2020-08-30 [1] CRAN (R 4.2.0)
   #>  cli           3.6.1    2023-03-23 [1] CRAN (R 4.2.0)
   #>  digest        0.6.33   2023-07-07 [1] CRAN (R 4.2.0)
   #>  dplyr       * 1.1.3    2023-09-03 [1] CRAN (R 4.2.0)
   #>  evaluate      0.20     2023-01-17 [1] CRAN (R 4.2.0)
   #>  fansi         1.0.4    2023-01-22 [1] CRAN (R 4.2.0)
   #>  fastmap       1.1.1    2023-02-24 [1] CRAN (R 4.2.0)
   #>  fs            1.6.3    2023-07-20 [1] CRAN (R 4.2.0)
   #>  generics      0.1.3    2022-07-05 [1] CRAN (R 4.2.1)
   #>  glue          1.6.2    2022-02-24 [1] CRAN (R 4.2.0)
   #>  htmltools     0.5.6    2023-08-10 [1] CRAN (R 4.2.0)
   #>  knitr         1.42     2023-01-25 [1] CRAN (R 4.2.0)
   #>  lifecycle     1.0.3    2022-10-07 [1] CRAN (R 4.2.0)
   #>  magrittr      2.0.3    2022-03-30 [1] CRAN (R 4.2.0)
   #>  pillar        1.9.0    2023-03-22 [1] CRAN (R 4.2.0)
   #>  pkgconfig     2.0.3    2019-09-22 [1] CRAN (R 4.2.0)
   #>  purrr         1.0.2    2023-08-10 [1] CRAN (R 4.2.0)
   #>  R.cache       0.16.0   2022-07-21 [1] CRAN (R 4.2.0)
   #>  R.methodsS3   1.8.2    2022-06-13 [1] CRAN (R 4.2.0)
   #>  R.oo          1.25.0   2022-06-12 [1] CRAN (R 4.2.0)
   #>  R.utils       2.12.2   2022-11-11 [1] CRAN (R 4.2.0)
   #>  R6            2.5.1    2021-08-19 [1] CRAN (R 4.2.0)
   #>  reprex        2.0.2    2022-08-17 [3] CRAN (R 4.2.0)
   #>  rlang         1.1.1    2023-04-28 [1] CRAN (R 4.2.0)
   #>  rmarkdown     2.21     2023-03-26 [1] CRAN (R 4.2.0)
   #>  rstudioapi    0.14     2022-08-22 [1] CRAN (R 4.2.1)
   #>  sessioninfo   1.2.2    2021-12-06 [3] CRAN (R 4.2.0)
   #>  styler        1.7.0    2022-03-13 [1] CRAN (R 4.2.0)
   #>  tibble        3.2.1    2023-03-20 [1] CRAN (R 4.2.0)
   #>  tidyselect    1.2.0    2022-10-10 [1] CRAN (R 4.2.0)
   #>  utf8          1.2.3    2023-01-31 [1] CRAN (R 4.2.0)
   #>  vctrs         0.6.3    2023-06-14 [1] CRAN (R 4.2.0)
   #>  withr         2.5.0    2022-03-03 [1] CRAN (R 4.2.0)
   #>  xfun          0.39     2023-04-20 [1] CRAN (R 4.2.0)
   #>  yaml          2.3.7    2023-01-23 [1] CRAN (R 4.2.0)
   #> 
   #>  [1] /Users/Anna/Library/R/arm64/4.2/library
   #>  [2] 
/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/site-library
   #>  [3] /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library
   #> 
   #> 
──────────────────────────────────────────────────────────────────────────────
   ```
   
   </details>
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to