[jira] [Comment Edited] (ARROW-14679) [R] [C++] Handle suffix argument in joins

Vibhatha Lakmal Abeykoon (Jira) Thu, 30 Dec 2021 20:54:05 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-14679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17467099#comment-17467099
 ]


Vibhatha Lakmal Abeykoon edited comment on ARROW-14679 at 12/31/21, 4:53 AM:
-----------------------------------------------------------------------------

" I'm happy to split this into separate issues if that's easier (though I'm not 
totally sure that it's necessary), but there are three issues here that we 
should resolve:"

> Yes, the subtask was just added to make it neater, but that is very optional 
> :) 

 
 * "Be able to successfully join two tables with columns that have the same 
names (but aren't used as keys)."

 * 
 ** > I think the Pandas.merge is the expected functionality (please comment 
[~westonpace] , [~jonkeane]). 
 * Be able to only add the unique-name-making affixes to the columns that are 
duplicated (if I have two tables with the cols: [id, col_a, col_b] and [id, 
col_b, col_c, col_d] and I join them (with the key being id, I should get [id, 
col_a, col_b.x, col_b.y, col_c, col_d] (or the prefix version with x.col_b, 
y.col_b if we allow prefixes)
 ** > First we need to decide whether to keep prefix or suffix. Having both 
doesn't make sense (my point of view) and it could be very confusing when we 
handle the case in the code. So we need to decide what option we need to 
expose. If we go for suffixes as Pandas is doing, we need to avoid using 
prefixes in the C++ core and replace it with suffixes. The related 
functionality change is trivial, but if we have already exposed this to 
multiple languages we have to change those function signatures and test cases. 
But there is another way, we can do a schema refactor just after getting the 
join respone and do a string comparison and do it in the R level, but that 
would be very untidy and hard to maintain. 
 * "Be able to use suffixes"
> +1 for this and I suggest dropping prefix if it is possible and be consistent 
> with the R stack and Pandas stack since most users are familar with suffixes.

"

 

Pandas Example

-----------------   
{code:java}
    import pandas as pd

    df_l = pd.DataFrame({"id": [10, 20, 30, 40, 50, 10, 30],
                        "a": [11, 12, 14, 14, 15, 16, 17],
                        "b": [1, 2, 3, 4, 5, 6, 7]
                      })
    
    df_r = pd.DataFrame({"id": [10, 10, 12, 41, 51, 20, 30],
                    "a": [21, 22, 24, 24, 25, 26, 27],
                    "c": [1, 2, 3, 4, 5, 6, 7]
                  })
                      
    print(df_l)
    print("-------------")
    print(df_r)
    
    df_join = df_l.merge(df_r, on="id", how="inner", suffixes=(".x", ".y"))
    
    print(df_join)
    
    """
    Output:
       id  a.x  b  a.y  c
    0  10   11  1   21  1
    1  10   11  1   22  2
    2  10   16  6   21  1
    3  10   16  6   22  2
    4  20   12  2   26  6
    5  30   14  3   27  7
    6  30   17  7   27  7
    """
{code}
 


was (Author: vibhatha):
" I'm happy to split this into separate issues if that's easier (though I'm not 
totally sure that it's necessary), but there are three issues here that we 
should resolve:"

> Yes, the subtask was just added to make it neater, but that is very optional 
> :) 

 
 * "Be able to successfully join two tables with columns that have the same 
names (but aren't used as keys)."

 * 
 ** > I think the Pandas.merge is the expected functionality (please comment 
[~westonpace] , [~jonkeane]). 
 * Be able to only add the unique-name-making affixes to the columns that are 
duplicated (if I have two tables with the cols: [id, col_a, col_b] and [id, 
col_b, col_c, col_d] and I join them (with the key being id, I should get [id, 
col_a, col_b.x, col_b.y, col_c, col_d] (or the prefix version with x.col_b, 
y.col_b if we allow prefixes)
 ** > First we need to decide whether to keep prefix or suffix. Having both 
doesn't make sense (my point of view) and it could be very confusing when we 
handle the case in the code. So we need to decide what option we need to 
expose. If we go for suffixes as Pandas is doing, we need to avoid using 
prefixes in the C++ core and replace it with suffixes. The related 
functionality change is trivial, but if we have already exposed this to 
multiple languages we have to change those function signatures and test cases. 
But there is another way, we can do a schema refactor just after getting the 
join respone and do a string comparison and do it in the R level, but that 
would be very untidy and hard to maintain. 
 * "Be able to use suffixes"
> +1 for this and I suggest dropping prefix if it is possible and be consistent 
> with the R stack and Pandas stack since most users are familar with suffixes.

"

 

Pandas Example

-----------------   
{code:java}
import pandas as pd

df_l = pd.DataFrame({"id": [10, 20, 30, 40, 50, 10, 30],
                        "a": [11, 12, 14, 14, 15, 16, 17],
                        "b": [1, 2, 3, 4, 5, 6, 7]
                      })
    
    df_r = pd.DataFrame({"id": [10, 10, 12, 41, 51, 20, 30],
                    "a": [21, 22, 24, 24, 25, 26, 27],
                    "c": [1, 2, 3, 4, 5, 6, 7]
                  })
                      
    print(df_l)
    print("-------------")
    print(df_r)
    
    df_join = df_l.merge(df_r, on="id", how="inner", suffixes=(".x", ".y"))
    
    print(df_join)
    
    """
    Output:
       id  a.x  b  a.y  c
    0  10   11  1   21  1
    1  10   11  1   22  2
    2  10   16  6   21  1
    3  10   16  6   22  2
    4  20   12  2   26  6
    5  30   14  3   27  7
    6  30   17  7   27  7
    """
{code}
 

> [R] [C++] Handle suffix argument in joins
> -----------------------------------------
>
>                 Key: ARROW-14679
>                 URL: https://issues.apache.org/jira/browse/ARROW-14679
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, R
>            Reporter: Jonathan Keane
>            Assignee: Vibhatha Lakmal Abeykoon
>            Priority: Major
>              Labels: pull-request-available, query-engine
>             Fix For: 7.0.0
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> If there is a name collision, we need to do something 
> https://github.com/apache/arrow/blob/a3746040d8a3ddb84bab6c7ca4771b6c120e3444/r/R/dplyr-join.R#L31
> A few notes:
> * arrow doesn't seem to actually be able to apply the prefixes (I'm getting 
> errors when trying), I couldn't tell if there were tests of this — I couldn't 
> find any, so I'm not sure if I'm calling this wrong or if it's not working at 
> all.
> * arrow always appends the affixes (where as dplyr only adds them if there is 
> a name collision)
> * arrow only supports prefixes (can we configure this, or ask the clients to 
> provide new names?) in the tests I wrote I've worked around this, but it 
> would be nice to be able to match dplyr/allow things other than prefix



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Comment Edited] (ARROW-14679) [R] [C++] Handle suffix argument in joins

Reply via email to