[
https://issues.apache.org/jira/browse/ARROW-15957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17508249#comment-17508249
]
Alessandro Molina commented on ARROW-15957:
-------------------------------------------
FYI, In some cases I don't think using `project` is a viable workaround.
For joins if suffixes are provided, you will only know the name of the columns
after the join operation and thus it's fairly hard to build the right
projection (you would have to manually. compute column collisions yourself).
Especially since there is no way to do a "Project All" to get _all_ the
resulting columns from the join apart the duplicated keys
> [C++] Add option to consolidate key columns in hash join
> --------------------------------------------------------
>
> Key: ARROW-15957
> URL: https://issues.apache.org/jira/browse/ARROW-15957
> Project: Apache Arrow
> Issue Type: New Feature
> Components: C++
> Reporter: Weston Pace
> Priority: Major
>
> Currently the hash join outputs key columns from both sides. On an outer
> join this can help distinguish between a row that matched but had entirely
> null payloads on one side and a row that didn't match on one side.
> However, that distinction is sometimes not very important and many databases
> will simply coalesce the key columns into one. For example, we might get an
> outer join result today that looks like:
> {noformat}
> L_KEY | R_KEY | L_PAY | R_PAY
> 0 0 x Y
> NULL 1 NULL Z
> 2 NULL A NULL
> {noformat}
> Ideally we could specify a "combine key columns" option to get a result that
> looks like:
> {noformat}
> KEY | L_PAY | R_PAY
> 0 x Y
> 1 NULL Z
> 2 A NULL
> {noformat}
> This can be done today with an extra project step, and it isn't likely to
> offer much performance benefit, but from a usability perspective it would be
> nice if users didn't have to do this extra project step.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)