[
https://issues.apache.org/jira/browse/ARROW-15838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567939#comment-17567939
]
Joris Van den Bossche commented on ARROW-15838:
-----------------------------------------------
This is indeed what pyarrow does
(https://github.com/apache/arrow/blob/4db32223a575b92f1e6baef0faa5fce39abc35fb/python/pyarrow/_exec_plan.pyx#L304-L343).
However, it leads to quite some complex code (to infer which fields to
project, names can change etc so we need to get the field indices in the output
that are the original key columns), and if both R and Python need it, that
might be a good reason to integrate this more in the join functionality of
Acero itself.
> [C++] Key column behavior in joins
> ----------------------------------
>
> Key: ARROW-15838
> URL: https://issues.apache.org/jira/browse/ARROW-15838
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++
> Reporter: Jonathan Keane
> Priority: Major
> Fix For: 10.0.0
>
>
> By default in dplyr (and possibly in pandas too?) coalesces the key column
> for full joins to be the (non-null) values from both key columns:
> {code}
> > left <- tibble::tibble(
> key = c(1, 2),
> A = c(0, 1),
> )
> left_tab <- Table$create(left)
> > right <- tibble::tibble(
> key = c(2, 3),
> B = c(0, 1),
> )
> right_tab <- Table$create(right)
> > left %>% full_join(right)
> Joining, by = "key"
> # A tibble: 3 × 3
> key A B
> <dbl> <dbl> <dbl>
> 1 1 0 NA
> 2 2 1 0
> 3 3 NA 1
> > left_tab %>% full_join(right_tab) %>% collect()
> # A tibble: 3 × 3
> key A B
> <dbl> <dbl> <dbl>
> 1 2 1 0
> 2 1 0 NA
> 3 NA NA 1
> {code}
> And right join, we would expect the key from the right table to be in the
> result, but we get the key from the left instead:
> {code}
> > left <- tibble::tibble(
> key = c(1, 2),
> A = c(0, 1),
> )
> left_tab <- Table$create(left)
> > right <- tibble::tibble(
> key = c(2, 3),
> B = c(0, 1),
> )
> right_tab <- Table$create(right)
> > left %>% right_join(right)
> Joining, by = "key"
> # A tibble: 2 × 3
> key A B
> <dbl> <dbl> <dbl>
> 1 2 1 0
> 2 3 NA 1
> > left_tab %>% right_join(right_tab) %>% collect()
> # A tibble: 2 × 3
> key A B
> <dbl> <dbl> <dbl>
> 1 2 1 0
> 2 NA NA 1
> {code}
> Additionally, we should be able to keep both key columns with an option (cf
> https://github.com/apache/arrow/blob/9719eae66dcf38c966ae769215d27020a6dd5550/r/R/dplyr-join.R#L32)
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)