[ 
https://issues.apache.org/jira/browse/ARROW-14679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17466688#comment-17466688
 ] 

Vibhatha Lakmal Abeykoon edited comment on ARROW-14679 at 12/30/21, 5:20 AM:
-----------------------------------------------------------------------------

[~jonkeane] [~westonpace] 
Is the intent here is to match the dplyr R API for joins (left_join, 
inner_join, ...) match with the Arrow R join APIs? 

A note on this, when looking into Pandas (join, merge) and deplyr (join), there 
is no concept called prefix, they only provide suffix. I guess this is a 
standard followed. But in our source we have prefixes. Should this be the first 
thing to fix, use suffix within C++ code instead of prefix and make it 
available to the other language bindings? 

I want to make sure whether I have understood this correctly. For now I went 
throught the code base and see how it is being done. A related but not a direct 
issue is that, when we read the output from a join (given that we are using the 
execplan and a sink node to get the response out), we still have to provide a 
schema, so no matter which affix we use, the output table will have the name we 
provide as the schema. It is a bit bothering given that the schema should be 
sort of inferred with the given input data, but give user an opportunity to 
project what is needed. It is not clear how to grasp this idea clearly. Or is 
there a way to do the joins without using the ExecPlan by just calling the 
kernel (just curious are we exposing this kind of a functionality given an 
advance user just needs the join kernel and use their own dataflow model to 
move data among operators). May be I am not 100% familiary with the current 
approach, but just wanted to make a note about this. 


was (Author: vibhatha):
[~jonkeane] [~westonpace] 
Is the intent here is to match the dplyr R API for joins (left_join, 
inner_join, ...) match with the Arrow R join APIs? 

A note on this, when looking into Pandas (join, merge) and deplyr (join), there 
is no concept called prefix, they only provide suffix. I guess this is a 
standard followed. But in our source we have prefixes. Should this be the first 
thing to fix, use suffix within C++ code instead of prefix and make it 
available to the other language bindings? 

I want to make sure whether I have understood this correctly. For now I went 
throught the code base and see how it is being done. A related but not a direct 
issue is that, when we read the output from a join (given that we are using the 
execplan and a sink node to get the response out), we still have to provide a 
schema, so no matter which affix we use, the output table will have the name we 
provide as the schema. It is a bit bothering given that the schema should be 
sort of inferred with the given input data, but give user an opportunity to 
project what is needed. It is not clear how to grasp this idea clearly. May be 
I am not 100% familiary with this approach, but just wanted to make a note 
about this. 

> [R] [C++] Handle suffix argument in joins
> -----------------------------------------
>
>                 Key: ARROW-14679
>                 URL: https://issues.apache.org/jira/browse/ARROW-14679
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++, R
>            Reporter: Jonathan Keane
>            Assignee: Vibhatha Lakmal Abeykoon
>            Priority: Major
>              Labels: pull-request-available, query-engine
>             Fix For: 7.0.0
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> If there is a name collision, we need to do something 
> https://github.com/apache/arrow/blob/a3746040d8a3ddb84bab6c7ca4771b6c120e3444/r/R/dplyr-join.R#L31
> A few notes:
> * arrow doesn't seem to actually be able to apply the prefixes (I'm getting 
> errors when trying), I couldn't tell if there were tests of this — I couldn't 
> find any, so I'm not sure if I'm calling this wrong or if it's not working at 
> all.
> * arrow always appends the affixes (where as dplyr only adds them if there is 
> a name collision)
> * arrow only supports prefixes (can we configure this, or ask the clients to 
> provide new names?) in the tests I wrote I've worked around this, but it 
> would be nice to be able to match dplyr/allow things other than prefix



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to