[
https://issues.apache.org/jira/browse/ARROW-14679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17467099#comment-17467099
]
Vibhatha Lakmal Abeykoon edited comment on ARROW-14679 at 12/31/21, 4:53 AM:
-----------------------------------------------------------------------------
" I'm happy to split this into separate issues if that's easier (though I'm not
totally sure that it's necessary), but there are three issues here that we
should resolve:"
> Yes, the subtask was just added to make it neater, but that is very optional
> :)
* "Be able to successfully join two tables with columns that have the same
names (but aren't used as keys)."
*
** > I think the Pandas.merge is the expected functionality (please comment
[~westonpace] , [~jonkeane]).
* Be able to only add the unique-name-making affixes to the columns that are
duplicated (if I have two tables with the cols: [id, col_a, col_b] and [id,
col_b, col_c, col_d] and I join them (with the key being id, I should get [id,
col_a, col_b.x, col_b.y, col_c, col_d] (or the prefix version with x.col_b,
y.col_b if we allow prefixes)
** > First we need to decide whether to keep prefix or suffix. Having both
doesn't make sense (my point of view) and it could be very confusing when we
handle the case in the code. So we need to decide what option we need to
expose. If we go for suffixes as Pandas is doing, we need to avoid using
prefixes in the C++ core and replace it with suffixes. The related
functionality change is trivial, but if we have already exposed this to
multiple languages we have to change those function signatures and test cases.
But there is another way, we can do a schema refactor just after getting the
join respone and do a string comparison and do it in the R level, but that
would be very untidy and hard to maintain.
* "Be able to use suffixes"
> +1 for this and I suggest dropping prefix if it is possible and be consistent
> with the R stack and Pandas stack since most users are familar with suffixes.
"
Pandas Example
-----------------
{code:java}
import pandas as pd
df_l = pd.DataFrame({"id": [10, 20, 30, 40, 50, 10, 30],
"a": [11, 12, 14, 14, 15, 16, 17],
"b": [1, 2, 3, 4, 5, 6, 7]
})
df_r = pd.DataFrame({"id": [10, 10, 12, 41, 51, 20, 30],
"a": [21, 22, 24, 24, 25, 26, 27],
"c": [1, 2, 3, 4, 5, 6, 7]
})
print(df_l)
print("-------------")
print(df_r)
df_join = df_l.merge(df_r, on="id", how="inner", suffixes=(".x", ".y"))
print(df_join)
"""
Output:
id a.x b a.y c
0 10 11 1 21 1
1 10 11 1 22 2
2 10 16 6 21 1
3 10 16 6 22 2
4 20 12 2 26 6
5 30 14 3 27 7
6 30 17 7 27 7
"""
{code}
was (Author: vibhatha):
" I'm happy to split this into separate issues if that's easier (though I'm not
totally sure that it's necessary), but there are three issues here that we
should resolve:"
> Yes, the subtask was just added to make it neater, but that is very optional
> :)
* "Be able to successfully join two tables with columns that have the same
names (but aren't used as keys)."
*
** > I think the Pandas.merge is the expected functionality (please comment
[~westonpace] , [~jonkeane]).
* Be able to only add the unique-name-making affixes to the columns that are
duplicated (if I have two tables with the cols: [id, col_a, col_b] and [id,
col_b, col_c, col_d] and I join them (with the key being id, I should get [id,
col_a, col_b.x, col_b.y, col_c, col_d] (or the prefix version with x.col_b,
y.col_b if we allow prefixes)
** > First we need to decide whether to keep prefix or suffix. Having both
doesn't make sense (my point of view) and it could be very confusing when we
handle the case in the code. So we need to decide what option we need to
expose. If we go for suffixes as Pandas is doing, we need to avoid using
prefixes in the C++ core and replace it with suffixes. The related
functionality change is trivial, but if we have already exposed this to
multiple languages we have to change those function signatures and test cases.
But there is another way, we can do a schema refactor just after getting the
join respone and do a string comparison and do it in the R level, but that
would be very untidy and hard to maintain.
* "Be able to use suffixes"
> +1 for this and I suggest dropping prefix if it is possible and be consistent
> with the R stack and Pandas stack since most users are familar with suffixes.
"
Pandas Example
-----------------
{code:java}
import pandas as pd
df_l = pd.DataFrame({"id": [10, 20, 30, 40, 50, 10, 30],
"a": [11, 12, 14, 14, 15, 16, 17],
"b": [1, 2, 3, 4, 5, 6, 7]
})
df_r = pd.DataFrame({"id": [10, 10, 12, 41, 51, 20, 30],
"a": [21, 22, 24, 24, 25, 26, 27],
"c": [1, 2, 3, 4, 5, 6, 7]
})
print(df_l)
print("-------------")
print(df_r)
df_join = df_l.merge(df_r, on="id", how="inner", suffixes=(".x", ".y"))
print(df_join)
"""
Output:
id a.x b a.y c
0 10 11 1 21 1
1 10 11 1 22 2
2 10 16 6 21 1
3 10 16 6 22 2
4 20 12 2 26 6
5 30 14 3 27 7
6 30 17 7 27 7
"""
{code}
> [R] [C++] Handle suffix argument in joins
> -----------------------------------------
>
> Key: ARROW-14679
> URL: https://issues.apache.org/jira/browse/ARROW-14679
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++, R
> Reporter: Jonathan Keane
> Assignee: Vibhatha Lakmal Abeykoon
> Priority: Major
> Labels: pull-request-available, query-engine
> Fix For: 7.0.0
>
> Time Spent: 20m
> Remaining Estimate: 0h
>
> If there is a name collision, we need to do something
> https://github.com/apache/arrow/blob/a3746040d8a3ddb84bab6c7ca4771b6c120e3444/r/R/dplyr-join.R#L31
> A few notes:
> * arrow doesn't seem to actually be able to apply the prefixes (I'm getting
> errors when trying), I couldn't tell if there were tests of this — I couldn't
> find any, so I'm not sure if I'm calling this wrong or if it's not working at
> all.
> * arrow always appends the affixes (where as dplyr only adds them if there is
> a name collision)
> * arrow only supports prefixes (can we configure this, or ask the clients to
> provide new names?) in the tests I wrote I've worked around this, but it
> would be nice to be able to match dplyr/allow things other than prefix
--
This message was sent by Atlassian Jira
(v8.20.1#820001)