[jira] [Commented] (ARROW-15838) [C++] Key column behavior in joins

Joris Van den Bossche (Jira) Mon, 18 Jul 2022 04:22:06 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-15838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17567939#comment-17567939
 ]


Joris Van den Bossche commented on ARROW-15838:
-----------------------------------------------

This is indeed what pyarrow does 
(https://github.com/apache/arrow/blob/4db32223a575b92f1e6baef0faa5fce39abc35fb/python/pyarrow/_exec_plan.pyx#L304-L343).
 However, it leads to quite some complex code (to infer which fields to 
project, names can change etc so we need to get the field indices in the output 
that are the original key columns), and if both R and Python need it, that 
might be a good reason to integrate this more in the join functionality of 
Acero itself.

> [C++] Key column behavior in joins
> ----------------------------------
>
>                 Key: ARROW-15838
>                 URL: https://issues.apache.org/jira/browse/ARROW-15838
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Jonathan Keane
>            Priority: Major
>             Fix For: 10.0.0
>
>
> By default in dplyr (and possibly in pandas too?) coalesces the key column 
> for full joins to be the (non-null) values from both key columns:
> {code}
> > left <- tibble::tibble(
>   key = c(1, 2),
>   A = c(0, 1),                  
> )                      
> left_tab <- Table$create(left)
> > right <- tibble::tibble(
>   key = c(2, 3),
>   B = c(0, 1),
> )                      
> right_tab <- Table$create(right)
> > left %>% full_join(right) 
> Joining, by = "key"
> # A tibble: 3 × 3
>     key     A     B
>   <dbl> <dbl> <dbl>
> 1     1     0    NA
> 2     2     1     0
> 3     3    NA     1
> > left_tab %>% full_join(right_tab) %>% collect()
> # A tibble: 3 × 3
>     key     A     B
>   <dbl> <dbl> <dbl>
> 1     2     1     0
> 2     1     0    NA
> 3    NA    NA     1
> {code}
> And right join, we would expect the key from the right table to be in the 
> result, but we get the key from the left instead:
> {code}
> > left <- tibble::tibble(
>   key = c(1, 2),
>   A = c(0, 1),                  
> )                      
> left_tab <- Table$create(left)
> > right <- tibble::tibble(
>   key = c(2, 3),
>   B = c(0, 1),
> )                      
> right_tab <- Table$create(right)
> > left %>% right_join(right)
> Joining, by = "key"
> # A tibble: 2 × 3
>     key     A     B
>   <dbl> <dbl> <dbl>
> 1     2     1     0
> 2     3    NA     1
> > left_tab %>% right_join(right_tab) %>% collect()
> # A tibble: 2 × 3
>     key     A     B
>   <dbl> <dbl> <dbl>
> 1     2     1     0
> 2    NA    NA     1
> {code}
> Additionally, we should be able to keep both key columns with an option (cf 
> https://github.com/apache/arrow/blob/9719eae66dcf38c966ae769215d27020a6dd5550/r/R/dplyr-join.R#L32)
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (ARROW-15838) [C++] Key column behavior in joins

Reply via email to