jorisvandenbossche commented on pull request #12452: URL: https://github.com/apache/arrow/pull/12452#issuecomment-1069100228
Thanks for the updates! (small workflow request: could you keep new changes in a new commit instead of squashing it into a previous one? That makes it much easier to see what you changed since the last review) > That's an interesting point. Personally I think that for outer joins it makes a lot of sense to have both columns. Coalescing the key columns would make the information about from which table the key comes from getting lost. I think it's more reasonable to let users decide if they want to coalesce outer join keys or not, especially given that the coalesce operation would add a cost as we don't provide it in joins out of the box. It indeed gives more information (although you will also often see it from other columns having all nulls if the key was not present). But there is also the expectation / usability for people coming from libraries like pandas or dplyr. Now, I was looking a bit at some other libraries how they handle this. So pandas and dplyr combine the both join columns in a single column. I was checking ibis, but I _think_ it might depend on the backend you are using in that case. You need to actively select the columns you want to retain after the join (so you could say that by default they retain both join columns). With the pandas backend though, it does coalesce nulls in the left key column with values from the right column. But eg the sqlite backend does not do that for a "right" join (this backend doesn't support an outer join). Duckdb seems to preserve both left/right join columns for inner/left join (and doesn't support an outer join). Polars follows the pandas / dplyr behaviour. In general, I think the behaviour depends on whether the library is SQL "based" (wrapping sql, or using mostly sql concepts) or not (where dubck and ibis fall into that group). Since in SQL, you always have to explicitly select columns as well (a project) while doing a join. Some of my explorations: https://notebooksharing.space/view/902129774575d83b4e3096d4a7c1eb26b8e830842096d3dc01a600d220848d99#displayOptions= > IMHO, it would be more consistent to get two key columns for all joins. That's certainly the most consistent, but also not always very practical IMO. For example, we can't easily deal with duplicate column names. As an example, if you would do a `group_by` after the `join`, you get errors about duplicate columns names (then you can of course use the `suffix` option, but that's also not super convenient). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
