Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]

via GitHub Thu, 29 May 2025 15:03:16 -0700


zanmato1984 commented on PR #46566:
URL: https://github.com/apache/arrow/pull/46566#issuecomment-2920686680


   Thank you @xingyu-long for contributing this!
   
   I'd first address your concern of:
   > it seems we didn't return empty, instead, we return the `right outer`? it 
seems the join type takes higher priority than filter operation for the final 
result?
   > 
   > btw, it seems fine with inner join type.
   
   Yes, this is expected by SQL semantic. And this is also the difference 
between you put an expression within `ON` condition of `JOIN` and that within 
`WHERE` clause, e.g.,
   ```FROM t1 LEFT JOIN t2 ON t1.value = x and t2.value = y```
   does not equal to
   ```FROM t1 LEFT JOIN t2 ON true WHERE t1.value = x and t2.value = y```
   (They are equivalent ONLY for inner joins.)
   This is quite understandable because otherwise you wouldn't need most of 
join types except inner :)
   
   Conceptually, all subexpressions in `ON` condition are equally contributing 
to determine if two rows from each side are a "match" (the whole expression 
evaluates `true`) or a "non-match" (the whole expression evaluates `null` or 
`false`). It's just that in practice, most query engines do hash join that 
requires at least one equal condition with columns from both sides, and for 
such conditions the columns are used as join "key"s (in your case the join key 
is implicitly specified by columns with common name). The rest of the 
expression is normally treated as so-called "residual filter" (this is what 
your PR added). Now back to the "conceptually", depending on the join type 
(inner/left outer/right outer/etc), rows are then processed differently. Take 
inner and left outer as two examples:
   * inner join will keep all the columns from both sides for a match, and 
discard the entire row for a non-match - this is the same as if you do the 
filter on the table scan first than apply join.
   * left outer join will always keep the left side columns, and keep the right 
side columns as well for a match, or discard the right side columns (by filling 
`null`s) for a non-match (but this row is still emitted in the join result).
   
   > for example, let's assume that we have two tables which have some common 
fields (`id` and `name`), and we'd like to join them by id and then filter name 
with certain pattern. so without exposing this API to python, we probably need 
to maintain a big intermediate state of temp join and then apply the filter on 
top of it.
   
   Yes this is necessary for preserving the SQL-like join semantic - as long as 
you write the filter in the `ON` condition. Again, the filter support you are 
adding is the "residual filter" (the subexpressions other than join keys in 
`ON` condition), not a regular "filter".
   
   > but if we can apply the filter on both tables first before we joining two 
tables, it would be more efficient? that's why I'd like to confirm what's the 
expected behavior for this filter in c++ implementation.
   
   In this case you can just do the filter ahead of join, e.g.,
   ```
   t2_filtered = t2.filter(pc.equal(pc.field("n_legs"), 200)
   t1.join(t2_filtered, 'id', join_type="right outer")
   ```
   As long as it is what you needed.
   
   > 1. it seems filter cannot apply for both side, i.e same field for both 
table/schema
   
   This is an independent problem. Because join is concatenating columns from 
both sides, so it is possible that the result table contains columns with the 
same name. If so, you won't be able to further reference a such column without 
ambiguity. You can specify `output_suffix_for_left/right` to append unique 
identifiers to their column names, so that you can disambiguate them.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Re: [PR] GH-46572: [Python] expose filter option to python for join [arrow]

Reply via email to