zanmato1984 commented on PR #46566: URL: https://github.com/apache/arrow/pull/46566#issuecomment-2920686680
Thank you @xingyu-long for contributing this! I'd first address your concern of: > it seems we didn't return empty, instead, we return the `right outer`? it seems the join type takes higher priority than filter operation for the final result? > > btw, it seems fine with inner join type. Yes, this is expected by SQL semantic. And this is also the difference between you put an expression within `ON` condition of `JOIN` and that within `WHERE` clause, e.g., ```FROM t1 LEFT JOIN t2 ON t1.value = x and t2.value = y``` does not equal to ```FROM t1 LEFT JOIN t2 ON true WHERE t1.value = x and t2.value = y``` (They are equivalent ONLY for inner joins.) This is quite understandable because otherwise you wouldn't need most of join types except inner :) Conceptually, all subexpressions in `ON` condition are equally contributing to determine if two rows from each side are a "match" (the whole expression evaluates `true`) or a "non-match" (the whole expression evaluates `null` or `false`). It's just that in practice, most query engines do hash join that requires at least one equal condition with columns from both sides, and for such conditions the columns are used as join "key"s (in your case the join key is implicitly specified by columns with common name). The rest of the expression is normally treated as so-called "residual filter" (this is what your PR added). Now back to the "conceptually", depending on the join type (inner/left outer/right outer/etc), rows are then processed differently. Take inner and left outer as two examples: * inner join will keep all the columns from both sides for a match, and discard the entire row for a non-match - this is the same as if you do the filter on the table scan first than apply join. * left outer join will always keep the left side columns, and keep the right side columns as well for a match, or discard the right side columns (by filling `null`s) for a non-match (but this row is still emitted in the join result). > for example, let's assume that we have two tables which have some common fields (`id` and `name`), and we'd like to join them by id and then filter name with certain pattern. so without exposing this API to python, we probably need to maintain a big intermediate state of temp join and then apply the filter on top of it. Yes this is necessary for preserving the SQL-like join semantic - as long as you write the filter in the `ON` condition. Again, the filter support you are adding is the "residual filter" (the subexpressions other than join keys in `ON` condition), not a regular "filter". > but if we can apply the filter on both tables first before we joining two tables, it would be more efficient? that's why I'd like to confirm what's the expected behavior for this filter in c++ implementation. In this case you can just do the filter ahead of join, e.g., ``` t2_filtered = t2.filter(pc.equal(pc.field("n_legs"), 200) t1.join(t2_filtered, 'id', join_type="right outer") ``` As long as it is what you needed. > 1. it seems filter cannot apply for both side, i.e same field for both table/schema This is an independent problem. Because join is concatenating columns from both sides, so it is possible that the result table contains columns with the same name. If so, you won't be able to further reference a such column without ambiguity. You can specify `output_suffix_for_left/right` to append unique identifiers to their column names, so that you can disambiguate them. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org