annakrystalli commented on issue #14907:
URL: https://github.com/apache/arrow/issues/14907#issuecomment-1735674700
Hello!
I was wondering if this has been resolved as I'm still coming up against
behaviour that differs from `dplyr` behaviour in both `left_join()` and
`right_join()` (see reprex below) in `arrow` version 12.0.1.1.
What appears to me to be the difference is that while `dplyr` matches `NA`
values by default, `arrow` does not seem to.
Am I missing sth?
``` r
library(dplyr)
#>
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#>
#> filter, lag
#> The following objects are masked from 'package:base':
#>
#> intersect, setdiff, setequal, union
library(arrow)
#>
#> Attaching package: 'arrow'
#> The following object is masked from 'package:utils':
#>
#> timestamp
tbl1 <- tibble::tibble(
a = 1:3,
b = c("a", "b", NA),
d = c(letters[4:6])
)
tbl2 <- tibble::tibble(
b = c("b", NA),
c = c("a should be 2", "a should be 3")
)
# Left join tibbles, NAs matched
left_join(tbl2, tbl1)
#> Joining with `by = join_by(b)`
#> # A tibble: 2 × 4
#> b c a d
#> <chr> <chr> <int> <chr>
#> 1 b a should be 2 2 e
#> 2 <NA> a should be 3 3 f
# Left join arrow table & tibble, NAs NOT matched
left_join(as_arrow_table(tbl2), tbl1) %>% collect()
#> # A tibble: 2 × 4
#> b c a d
#> <chr> <chr> <int> <chr>
#> 1 b a should be 2 2 e
#> 2 <NA> a should be 3 NA <NA>
# Left join arrow table & arrow table, NAs NOT matched
left_join(as_arrow_table(tbl2), as_arrow_table(tbl1)) %>% collect()
#> # A tibble: 2 × 4
#> b c a d
#> <chr> <chr> <int> <chr>
#> 1 b a should be 2 2 e
#> 2 <NA> a should be 3 NA <NA>
# Same with right_join. It appears that the NAs are the problem as arrow
doesn't seem to count them as a match
right_join(tbl1, tbl2)
#> Joining with `by = join_by(b)`
#> # A tibble: 2 × 4
#> a b d c
#> <int> <chr> <chr> <chr>
#> 1 2 b e a should be 2
#> 2 3 <NA> f a should be 3
right_join(as_arrow_table(tbl1), as_arrow_table(tbl2)) %>% collect()
#> # A tibble: 2 × 4
#> a b d c
#> <int> <chr> <chr> <chr>
#> 1 2 b e a should be 2
#> 2 NA <NA> <NA> a should be 3
right_join(as_arrow_table(tbl1), tbl2) %>% collect()
#> # A tibble: 2 × 4
#> a b d c
#> <int> <chr> <chr> <chr>
#> 1 NA <NA> <NA> a should be 3
#> 2 2 b e a should be 2
```
<sup>Created on 2023-09-26 with [reprex
v2.0.2](https://reprex.tidyverse.org)</sup>
<details style="margin-bottom:10px;">
<summary>
Session info
</summary>
``` r
sessioninfo::session_info()
#> ─ Session info
───────────────────────────────────────────────────────────────
#> setting value
#> version R version 4.2.1 (2022-06-23)
#> os macOS Ventura 13.5.2
#> system aarch64, darwin20
#> ui X11
#> language (EN)
#> collate en_US.UTF-8
#> ctype en_US.UTF-8
#> tz Europe/Athens
#> date 2023-09-26
#> pandoc 3.1.1 @
/Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via
rmarkdown)
#>
#> ─ Packages
───────────────────────────────────────────────────────────────────
#> package * version date (UTC) lib source
#> arrow * 12.0.1.1 2023-07-18 [1] CRAN (R 4.2.0)
#> assertthat 0.2.1 2019-03-21 [1] CRAN (R 4.2.0)
#> bit 4.0.5 2022-11-15 [1] CRAN (R 4.2.0)
#> bit64 4.0.5 2020-08-30 [1] CRAN (R 4.2.0)
#> cli 3.6.1 2023-03-23 [1] CRAN (R 4.2.0)
#> digest 0.6.33 2023-07-07 [1] CRAN (R 4.2.0)
#> dplyr * 1.1.3 2023-09-03 [1] CRAN (R 4.2.0)
#> evaluate 0.20 2023-01-17 [1] CRAN (R 4.2.0)
#> fansi 1.0.4 2023-01-22 [1] CRAN (R 4.2.0)
#> fastmap 1.1.1 2023-02-24 [1] CRAN (R 4.2.0)
#> fs 1.6.3 2023-07-20 [1] CRAN (R 4.2.0)
#> generics 0.1.3 2022-07-05 [1] CRAN (R 4.2.1)
#> glue 1.6.2 2022-02-24 [1] CRAN (R 4.2.0)
#> htmltools 0.5.6 2023-08-10 [1] CRAN (R 4.2.0)
#> knitr 1.42 2023-01-25 [1] CRAN (R 4.2.0)
#> lifecycle 1.0.3 2022-10-07 [1] CRAN (R 4.2.0)
#> magrittr 2.0.3 2022-03-30 [1] CRAN (R 4.2.0)
#> pillar 1.9.0 2023-03-22 [1] CRAN (R 4.2.0)
#> pkgconfig 2.0.3 2019-09-22 [1] CRAN (R 4.2.0)
#> purrr 1.0.2 2023-08-10 [1] CRAN (R 4.2.0)
#> R.cache 0.16.0 2022-07-21 [1] CRAN (R 4.2.0)
#> R.methodsS3 1.8.2 2022-06-13 [1] CRAN (R 4.2.0)
#> R.oo 1.25.0 2022-06-12 [1] CRAN (R 4.2.0)
#> R.utils 2.12.2 2022-11-11 [1] CRAN (R 4.2.0)
#> R6 2.5.1 2021-08-19 [1] CRAN (R 4.2.0)
#> reprex 2.0.2 2022-08-17 [3] CRAN (R 4.2.0)
#> rlang 1.1.1 2023-04-28 [1] CRAN (R 4.2.0)
#> rmarkdown 2.21 2023-03-26 [1] CRAN (R 4.2.0)
#> rstudioapi 0.14 2022-08-22 [1] CRAN (R 4.2.1)
#> sessioninfo 1.2.2 2021-12-06 [3] CRAN (R 4.2.0)
#> styler 1.7.0 2022-03-13 [1] CRAN (R 4.2.0)
#> tibble 3.2.1 2023-03-20 [1] CRAN (R 4.2.0)
#> tidyselect 1.2.0 2022-10-10 [1] CRAN (R 4.2.0)
#> utf8 1.2.3 2023-01-31 [1] CRAN (R 4.2.0)
#> vctrs 0.6.3 2023-06-14 [1] CRAN (R 4.2.0)
#> withr 2.5.0 2022-03-03 [1] CRAN (R 4.2.0)
#> xfun 0.39 2023-04-20 [1] CRAN (R 4.2.0)
#> yaml 2.3.7 2023-01-23 [1] CRAN (R 4.2.0)
#>
#> [1] /Users/Anna/Library/R/arm64/4.2/library
#> [2]
/Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/site-library
#> [3] /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library
#>
#>
──────────────────────────────────────────────────────────────────────────────
```
</details>
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]