amol- edited a comment on pull request #12452:
URL: https://github.com/apache/arrow/pull/12452#issuecomment-1064943267
> I would personally prefer to see this comment addressed as well (or at
least get some thoughts on it):
>
> > You also need to specify the key column for both left and right table
separate. While this is certainly the most generic (since it can handle
different names in left and right table), I think it could also be nice to give
the user the possibility to just give one name (or list of names) in case it is
the same in left/right table (for better ergonomics when using this method)
>
I'll add support for omitting the right table keys and suffixing columns in
the output as supported by HashJoinNodeOptions.
> For the join keys columns in the output: you now selected one of the
columns for most joins, but not for outer join, I think? I am not fully sure if
we should do something different here for outer join (for example, both pandas
and dplyr will only have a single key column in the output also in the case of
an outer join)
That's an interesting point. Personally I think that for outer joins it
makes a lot of sense to have both columns. Coalescing the key columns would
make the information about from which table the key comes from getting lost. I
think it's more reasonable to let users decide if they want to coalesce outer
join keys or not, especially given that the coalesce operation would add a cost
as we don't provide it in joins out of the box.
For example
```
Key | Key_t2 | Other
1 | null | 55 <---- Obvious that the value comes from Table1
and not Table2
```
VS
```
Key | Other
1 | 55 <---- Where did "1" come from?
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]