zanmato1984 commented on PR #46566:
URL: https://github.com/apache/arrow/pull/46566#issuecomment-2920686680
Thank you @xingyu-long for contributing this!
I'd first address your concern of:
> it seems we didn't return empty, instead, we return the `right outer`? it
seems the join type takes higher priority than filter operation for the final
result?
>
> btw, it seems fine with inner join type.
Yes, this is expected by SQL semantic. And this is also the difference
between you put an expression within `ON` condition of `JOIN` and that within
`WHERE` clause, e.g.,
```FROM t1 LEFT JOIN t2 ON t1.value = x and t2.value = y```
does not equal to
```FROM t1 LEFT JOIN t2 ON true WHERE t1.value = x and t2.value = y```
(They are equivalent ONLY for inner joins.)
This is quite understandable because otherwise you wouldn't need most of
join types except inner :)
Conceptually, all subexpressions in `ON` condition are equally contributing
to determine if two rows from each side are a "match" (the whole expression
evaluates `true`) or a "non-match" (the whole expression evaluates `null` or
`false`). It's just that in practice, most query engines do hash join that
requires at least one equal condition with columns from both sides, and for
such conditions the columns are used as join "key"s (in your case the join key
is implicitly specified by columns with common name). The rest of the
expression is normally treated as so-called "residual filter" (this is what
your PR added). Now back to the "conceptually", depending on the join type
(inner/left outer/right outer/etc), rows are then processed differently. Take
inner and left outer as two examples:
* inner join will keep all the columns from both sides for a match, and
discard the entire row for a non-match - this is the same as if you do the
filter on the table scan first than apply join.
* left outer join will always keep the left side columns, and keep the right
side columns as well for a match, or discard the right side columns (by filling
`null`s) for a non-match (but this row is still emitted in the join result).
> for example, let's assume that we have two tables which have some common
fields (`id` and `name`), and we'd like to join them by id and then filter name
with certain pattern. so without exposing this API to python, we probably need
to maintain a big intermediate state of temp join and then apply the filter on
top of it.
Yes this is necessary for preserving the SQL-like join semantic - as long as
you write the filter in the `ON` condition. Again, the filter support you are
adding is the "residual filter" (the subexpressions other than join keys in
`ON` condition), not a regular "filter".
> but if we can apply the filter on both tables first before we joining two
tables, it would be more efficient? that's why I'd like to confirm what's the
expected behavior for this filter in c++ implementation.
In this case you can just do the filter ahead of join, e.g.,
```
t2_filtered = t2.filter(pc.equal(pc.field("n_legs"), 200)
t1.join(t2_filtered, 'id', join_type="right outer")
```
As long as it is what you needed.
> 1. it seems filter cannot apply for both side, i.e same field for both
table/schema
This is an independent problem. Because join is concatenating columns from
both sides, so it is possible that the result table contains columns with the
same name. If so, you won't be able to further reference a such column without
ambiguity. You can specify `output_suffix_for_left/right` to append unique
identifiers to their column names, so that you can disambiguate them.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]