jorisvandenbossche commented on pull request #12452:
URL: https://github.com/apache/arrow/pull/12452#issuecomment-1069100228


   Thanks for the updates! 
   (small workflow request: could you keep new changes in a new commit instead 
of squashing it into a previous one? That makes it much easier to see what you 
changed since the last review)
   
   > That's an interesting point. Personally I think that for outer joins it 
makes a lot of sense to have both columns. Coalescing the key columns would 
make the information about from which table the key comes from getting lost. I 
think it's more reasonable to let users decide if they want to coalesce outer 
join keys or not, especially given that the coalesce operation would add a cost 
as we don't provide it in joins out of the box.
   
   It indeed gives more information (although you will also often see it from 
other columns having all nulls if the key was not present). But there is also 
the expectation / usability for people coming from libraries like pandas or 
dplyr. 
   
   Now, I was looking a bit at some other libraries how they handle this. So 
pandas and dplyr combine the both join columns in a single column. I was 
checking ibis, but I _think_ it might depend on the backend you are using in 
that case. You need to actively select the columns you want to retain after the 
join (so you could say that by default they retain both join columns). With the 
pandas backend though, it does coalesce nulls in the left key column with 
values from the right column. But eg the sqlite backend does not do that for a 
"right" join (this backend doesn't support an outer join).
   
   Duckdb seems to preserve both left/right join columns for inner/left join 
(and doesn't support an outer join). 
   Polars follows the pandas / dplyr behaviour. 
   
   In general, I think the behaviour depends on whether the library is SQL 
"based" (wrapping sql, or using mostly sql concepts) or not (where dubck and 
ibis fall into that group). Since in SQL, you always have to explicitly select 
columns as well (a project) while doing a join. 
   
   Some of my explorations: 
https://notebooksharing.space/view/902129774575d83b4e3096d4a7c1eb26b8e830842096d3dc01a600d220848d99#displayOptions=
   
   > IMHO, it would be more consistent to get two key columns for all joins.
   
   That's certainly the most consistent, but also not always very practical 
IMO. For example, we can't easily deal with duplicate column names. As an 
example, if you would do a `group_by` after the `join`, you get errors about 
duplicate columns names (then you can of course use the `suffix` option, but 
that's also not super convenient).
   
   
   
   
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to