[
https://issues.apache.org/jira/browse/ARROW-15838?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Neal Richardson resolved ARROW-15838.
-------------------------------------
Resolution: Fixed
Issue resolved by pull request 14286
[https://github.com/apache/arrow/pull/14286]
> [C++] Key column behavior in joins
> ----------------------------------
>
> Key: ARROW-15838
> URL: https://issues.apache.org/jira/browse/ARROW-15838
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Jonathan Keane
> Assignee: Will Jones
> Priority: Major
> Labels: pull-request-available
> Fix For: 10.0.0
>
> Time Spent: 2h
> Remaining Estimate: 0h
>
> By default in dplyr (and possibly in pandas too?) coalesces the key column
> for full joins to be the (non-null) values from both key columns:
> {code}
> > left <- tibble::tibble(
> key = c(1, 2),
> A = c(0, 1),
> )
> left_tab <- Table$create(left)
> > right <- tibble::tibble(
> key = c(2, 3),
> B = c(0, 1),
> )
> right_tab <- Table$create(right)
> > left %>% full_join(right)
> Joining, by = "key"
> # A tibble: 3 × 3
> key A B
> <dbl> <dbl> <dbl>
> 1 1 0 NA
> 2 2 1 0
> 3 3 NA 1
> > left_tab %>% full_join(right_tab) %>% collect()
> # A tibble: 3 × 3
> key A B
> <dbl> <dbl> <dbl>
> 1 2 1 0
> 2 1 0 NA
> 3 NA NA 1
> {code}
> And right join, we would expect the key from the right table to be in the
> result, but we get the key from the left instead:
> {code}
> > left <- tibble::tibble(
> key = c(1, 2),
> A = c(0, 1),
> )
> left_tab <- Table$create(left)
> > right <- tibble::tibble(
> key = c(2, 3),
> B = c(0, 1),
> )
> right_tab <- Table$create(right)
> > left %>% right_join(right)
> Joining, by = "key"
> # A tibble: 2 × 3
> key A B
> <dbl> <dbl> <dbl>
> 1 2 1 0
> 2 3 NA 1
> > left_tab %>% right_join(right_tab) %>% collect()
> # A tibble: 2 × 3
> key A B
> <dbl> <dbl> <dbl>
> 1 2 1 0
> 2 NA NA 1
> {code}
> Additionally, we should be able to keep both key columns with an option (cf
> https://github.com/apache/arrow/blob/9719eae66dcf38c966ae769215d27020a6dd5550/r/R/dplyr-join.R#L32)
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)