[ 
https://issues.apache.org/jira/browse/ARROW-14679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17466688#comment-17466688
 ] 

Vibhatha Lakmal Abeykoon commented on ARROW-14679:
--------------------------------------------------

[~jonkeane] [~westonpace] 
Is the intent here is to match the dplyr R API for joins (left_join, 
inner_join, ...) match with the Arrow R join APIs? 

A note on this, when looking into Pandas (join, merge) and deplyr (join), there 
is no concept called prefix, they only provide suffix. I guess this is a 
standard followed. But in our source we have prefixes. Should this be the first 
thing to fix, use suffix within C++ code instead of prefix and make it 
available to the other language bindings? 

I want to make sure whether I have understood this correctly. For now I went 
throught the code base and see how it is being done. A related but not a direct 
issue is that, when we read the output from a join (given that we are using the 
execplan and a sink node to get the response out), we still have to provide a 
schema, so no matter which affix we use, the output table will have the name we 
provide as the schema. It is a bit bothering given that the schema should be 
sort of inferred with the given input data, but give user an opportunity to 
project what is needed. It is not clear how to grasp this idea clearly. May be 
I am not 100% familiary with this approach, but just wanted to make a note 
about this. 

> [R] [C++] Handle suffix argument in joins
> -----------------------------------------
>
>                 Key: ARROW-14679
>                 URL: https://issues.apache.org/jira/browse/ARROW-14679
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, R
>            Reporter: Jonathan Keane
>            Assignee: Vibhatha Lakmal Abeykoon
>            Priority: Major
>              Labels: pull-request-available, query-engine
>             Fix For: 7.0.0
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> If there is a name collision, we need to do something 
> https://github.com/apache/arrow/blob/a3746040d8a3ddb84bab6c7ca4771b6c120e3444/r/R/dplyr-join.R#L31
> A few notes:
> * arrow doesn't seem to actually be able to apply the prefixes (I'm getting 
> errors when trying), I couldn't tell if there were tests of this — I couldn't 
> find any, so I'm not sure if I'm calling this wrong or if it's not working at 
> all.
> * arrow always appends the affixes (where as dplyr only adds them if there is 
> a name collision)
> * arrow only supports prefixes (can we configure this, or ask the clients to 
> provide new names?) in the tests I wrote I've worked around this, but it 
> would be nice to be able to match dplyr/allow things other than prefix



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to