[GitHub] [arrow] jorisvandenbossche commented on pull request #12452: ARROW-14292: [C++][Python] Join foundation for Tables

GitBox Wed, 16 Mar 2022 06:00:57 -0700


jorisvandenbossche commented on pull request #12452:
URL: https://github.com/apache/arrow/pull/12452#issuecomment-1069100228

Thanks for the updates!
(small workflow request: could you keep new changes in a new commit instead
of squashing it into a previous one? That makes it much easier to see what you
changed since the last review)

> That's an interesting point. Personally I think that for outer joins it
makes a lot of sense to have both columns. Coalescing the key columns would
make the information about from which table the key comes from getting lost. I
think it's more reasonable to let users decide if they want to coalesce outer
join keys or not, especially given that the coalesce operation would add a cost
as we don't provide it in joins out of the box.

It indeed gives more information (although you will also often see it from
other columns having all nulls if the key was not present). But there is also
the expectation / usability for people coming from libraries like pandas or
dplyr.

Now, I was looking a bit at some other libraries how they handle this. So
pandas and dplyr combine the both join columns in a single column. I was
checking ibis, but I _think_ it might depend on the backend you are using in
that case. You need to actively select the columns you want to retain after the
join (so you could say that by default they retain both join columns). With the
pandas backend though, it does coalesce nulls in the left key column with
values from the right column. But eg the sqlite backend does not do that for a
"right" join (this backend doesn't support an outer join).

Duckdb seems to preserve both left/right join columns for inner/left join
(and doesn't support an outer join).
Polars follows the pandas / dplyr behaviour.

In general, I think the behaviour depends on whether the library is SQL
"based" (wrapping sql, or using mostly sql concepts) or not (where dubck and
ibis fall into that group). Since in SQL, you always have to explicitly select
columns as well (a project) while doing a join.

Some of my explorations:
https://notebooksharing.space/view/902129774575d83b4e3096d4a7c1eb26b8e830842096d3dc01a600d220848d99#displayOptions=

> IMHO, it would be more consistent to get two key columns for all joins.

That's certainly the most consistent, but also not always very practical
IMO. For example, we can't easily deal with duplicate column names. As an
example, if you would do a `group_by` after the `join`, you get errors about
duplicate columns names (then you can of course use the `suffix` option, but
that's also not super convenient).

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] jorisvandenbossche commented on pull request #12452: ARROW-14292: [C++][Python] Join foundation for Tables

Reply via email to