[jira] [Comment Edited] (ARROW-14679) [R] [C++] Handle suffix argument in joins

Jonathan Keane (Jira) Thu, 30 Dec 2021 10:23:06 -0800


    [ 
https://issues.apache.org/jira/browse/ARROW-14679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17466957#comment-17466957
 ]


Jonathan Keane edited comment on ARROW-14679 at 12/30/21, 6:22 PM:
-------------------------------------------------------------------

bq. Is the intent here is to match the dplyr R API for joins (left_join, 
inner_join, ...) match with the Arrow R join APIs? 

I would like to be able to match the dplyr API if possible. We can deviate from 
it if we have to for some reason (and we can also do some pre/post processing 
in R if we need to: e.g. rename columns in a project after the join)

bq. A note on this, when looking into Pandas (join, merge) and deplyr (join), 
there is no concept called prefix, they only provide suffix

Yeah, suffix seems more right to me (if we only support one), though I will 
admit I haven't surveyed other systems to see what they use (many SQLs, you'ld 
need to reference the table as a prefiex (e.g. {{left_table.col AS col_new}} or 
the like). I tried to find what ibis does, but 
https://ibis-project.org/docs/generated/ibis.expr.api.TableExpr.inner_join.html 
doesn't have much info about that (that might be the wrong place to look for 
that too!). Would it make sense to make the choice of suffix/prefix an option? 
(that is not intended as a leading question — I'm not sure if the trade off of 
complexity is worth it here!)

bq. A related but not a direct issue is that, when we read the output from a 
join (given that we are using the execplan and a sink node to get the response 
out), we still have to provide a schema, so no matter which affix we use, the 
output table will have the name we provide as the schema. It is a bit bothering 
given that the schema should be sort of inferred with the given input data, but 
give user an opportunity to project what is needed. 

Hmmm maybe this is what I was running into when I couldn't get the prefixes to 
work at all (the tests on my branch). I haven't been able to trigger this 
feature successfully at all myself.

I'm happy to split this into separate issues if that's easier (though I'm not 
totally sure that it's necessary), but there are three issues here that we 
should resolve:

* Be able to successfully join two tables with columns that have the same names 
(but aren't used as keys). 
* Be able to only add the unique-name-making affixes to the columns that are 
duplicated (if I have two tables with the cols: [id, col_a, col_b] and [id, 
col_b, col_c, col_d] and I join them (with the key being id, I should get [id, 
col_a, col_b.x, col_b.y, col_c, col_d] (or the prefix version with x.col_b, 
y.col_b if we allow prefixes)
* Be able to use suffixes







was (Author: jonkeane):
> Is the intent here is to match the dplyr R API for joins (left_join, 
> inner_join, ...) match with the Arrow R join APIs? 

I would like to be able to match the dplyr API if possible. We can deviate from 
it if we have to for some reason (and we can also do some pre/post processing 
in R if we need to: e.g. rename columns in a project after the join)

> A note on this, when looking into Pandas (join, merge) and deplyr (join), 
> there is no concept called prefix, they only provide suffix

Yeah, suffix seems more right to me (if we only support one), though I will 
admit I haven't surveyed other systems to see what they use (many SQLs, you'ld 
need to reference the table as a prefiex (e.g. {{left_table.col AS col_new}} or 
the like). I tried to find what ibis does, but 
https://ibis-project.org/docs/generated/ibis.expr.api.TableExpr.inner_join.html 
doesn't have much info about that (that might be the wrong place to look for 
that too!). Would it make sense to make the choice of suffix/prefix an option? 
(that is not intended as a leading question — I'm not sure if the trade off of 
complexity is worth it here!)

> A related but not a direct issue is that, when we read the output from a join 
> (given that we are using the execplan and a sink node to get the response 
> out), we still have to provide a schema, so no matter which affix we use, the 
> output table will have the name we provide as the schema. It is a bit 
> bothering given that the schema should be sort of inferred with the given 
> input data, but give user an opportunity to project what is needed. 

Hmmm maybe this is what I was running into when I couldn't get the prefixes to 
work at all (the tests on my branch). I haven't been able to trigger this 
feature successfully at all myself.

I'm happy to split this into separate issues if that's easier (though I'm not 
totally sure that it's necessary), but there are three issues here that we 
should resolve:

* Be able to successfully join two tables with columns that have the same names 
(but aren't used as keys). 
* Be able to only add the unique-name-making affixes to the columns that are 
duplicated (if I have two tables with the cols: [id, col_a, col_b] and [id, 
col_b, col_c, col_d] and I join them (with the key being id, I should get [id, 
col_a, col_b.x, col_b.y, col_c, col_d] (or the prefix version with x.col_b, 
y.col_b if we allow prefixes)
* Be able to use suffixes






> [R] [C++] Handle suffix argument in joins
> -----------------------------------------
>
>                 Key: ARROW-14679
>                 URL: https://issues.apache.org/jira/browse/ARROW-14679
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, R
>            Reporter: Jonathan Keane
>            Assignee: Vibhatha Lakmal Abeykoon
>            Priority: Major
>              Labels: pull-request-available, query-engine
>             Fix For: 7.0.0
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> If there is a name collision, we need to do something 
> https://github.com/apache/arrow/blob/a3746040d8a3ddb84bab6c7ca4771b6c120e3444/r/R/dplyr-join.R#L31
> A few notes:
> * arrow doesn't seem to actually be able to apply the prefixes (I'm getting 
> errors when trying), I couldn't tell if there were tests of this — I couldn't 
> find any, so I'm not sure if I'm calling this wrong or if it's not working at 
> all.
> * arrow always appends the affixes (where as dplyr only adds them if there is 
> a name collision)
> * arrow only supports prefixes (can we configure this, or ask the clients to 
> provide new names?) in the tests I wrote I've worked around this, but it 
> would be nice to be able to match dplyr/allow things other than prefix



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Comment Edited] (ARROW-14679) [R] [C++] Handle suffix argument in joins

Reply via email to