[
https://issues.apache.org/jira/browse/ARROW-14679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17466688#comment-17466688
]
Vibhatha Lakmal Abeykoon edited comment on ARROW-14679 at 12/30/21, 5:20 AM:
-----------------------------------------------------------------------------
[~jonkeane] [~westonpace]
Is the intent here is to match the dplyr R API for joins (left_join,
inner_join, ...) match with the Arrow R join APIs?
A note on this, when looking into Pandas (join, merge) and deplyr (join), there
is no concept called prefix, they only provide suffix. I guess this is a
standard followed. But in our source we have prefixes. Should this be the first
thing to fix, use suffix within C++ code instead of prefix and make it
available to the other language bindings?
I want to make sure whether I have understood this correctly. For now I went
throught the code base and see how it is being done. A related but not a direct
issue is that, when we read the output from a join (given that we are using the
execplan and a sink node to get the response out), we still have to provide a
schema, so no matter which affix we use, the output table will have the name we
provide as the schema. It is a bit bothering given that the schema should be
sort of inferred with the given input data, but give user an opportunity to
project what is needed. It is not clear how to grasp this idea clearly. Or is
there a way to do the joins without using the ExecPlan by just calling the
kernel (just curious are we exposing this kind of a functionality given an
advance user just needs the join kernel and use their own dataflow model to
move data among operators). May be I am not 100% familiary with the current
approach, but just wanted to make a note about this.
was (Author: vibhatha):
[~jonkeane] [~westonpace]
Is the intent here is to match the dplyr R API for joins (left_join,
inner_join, ...) match with the Arrow R join APIs?
A note on this, when looking into Pandas (join, merge) and deplyr (join), there
is no concept called prefix, they only provide suffix. I guess this is a
standard followed. But in our source we have prefixes. Should this be the first
thing to fix, use suffix within C++ code instead of prefix and make it
available to the other language bindings?
I want to make sure whether I have understood this correctly. For now I went
throught the code base and see how it is being done. A related but not a direct
issue is that, when we read the output from a join (given that we are using the
execplan and a sink node to get the response out), we still have to provide a
schema, so no matter which affix we use, the output table will have the name we
provide as the schema. It is a bit bothering given that the schema should be
sort of inferred with the given input data, but give user an opportunity to
project what is needed. It is not clear how to grasp this idea clearly. May be
I am not 100% familiary with this approach, but just wanted to make a note
about this.
> [R] [C++] Handle suffix argument in joins
> -----------------------------------------
>
> Key: ARROW-14679
> URL: https://issues.apache.org/jira/browse/ARROW-14679
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++, R
> Reporter: Jonathan Keane
> Assignee: Vibhatha Lakmal Abeykoon
> Priority: Major
> Labels: pull-request-available, query-engine
> Fix For: 7.0.0
>
> Time Spent: 20m
> Remaining Estimate: 0h
>
> If there is a name collision, we need to do something
> https://github.com/apache/arrow/blob/a3746040d8a3ddb84bab6c7ca4771b6c120e3444/r/R/dplyr-join.R#L31
> A few notes:
> * arrow doesn't seem to actually be able to apply the prefixes (I'm getting
> errors when trying), I couldn't tell if there were tests of this — I couldn't
> find any, so I'm not sure if I'm calling this wrong or if it's not working at
> all.
> * arrow always appends the affixes (where as dplyr only adds them if there is
> a name collision)
> * arrow only supports prefixes (can we configure this, or ask the clients to
> provide new names?) in the tests I wrote I've worked around this, but it
> would be nice to be able to match dplyr/allow things other than prefix
--
This message was sent by Atlassian Jira
(v8.20.1#820001)