alamb commented on code in PR #7612:
URL: https://github.com/apache/arrow-datafusion/pull/7612#discussion_r1336355393
##########
datafusion/substrait/src/logical_plan/consumer.rs:
##########
@@ -331,6 +342,12 @@ pub async fn from_substrait_rel(
}
}
Some(RelType::Join(join)) => {
+ if join.post_join_filter.is_some() {
Review Comment:
👍
##########
datafusion/substrait/src/logical_plan/producer.rs:
##########
@@ -277,30 +277,47 @@ pub fn to_substrait_rel(
// parse filter if exists
let in_join_schema = join.left.schema().join(join.right.schema())?;
let join_filter = match &join.filter {
- Some(filter) => Some(Box::new(to_substrait_rex(
+ Some(filter) => Some(to_substrait_rex(
filter,
&Arc::new(in_join_schema),
0,
extension_info,
- )?)),
+ )?),
None => None,
};
+
// map the left and right columns to binary expressions in the
form `l = r`
// build a single expression for the ON condition, such as `l.a =
r.a AND l.b = r.b`
let eq_op = if join.null_equals_null {
Operator::IsNotDistinctFrom
} else {
Operator::Eq
};
-
- let join_expr = to_substrait_join_expr(
+ let join_on = to_substrait_join_expr(
&join.on,
eq_op,
join.left.schema(),
join.right.schema(),
extension_info,
- )?
- .map(Box::new);
+ )?;
+
+ // create conjunction between `join_on` and `join_filter` to embed
all join conditions,
+ // whether equal or non-equal in a single expression
+ let join_expr = match &join_on {
Review Comment:
Perhaps you could use `conjunction` here and simplify the code
https://docs.rs/datafusion/latest/datafusion/optimizer/utils/fn.conjunction.html
##########
datafusion/substrait/src/logical_plan/consumer.rs:
##########
@@ -331,6 +342,12 @@ pub async fn from_substrait_rel(
}
}
Some(RelType::Join(join)) => {
+ if join.post_join_filter.is_some() {
Review Comment:
👍
##########
datafusion/substrait/src/logical_plan/consumer.rs:
##########
@@ -351,41 +362,62 @@ pub async fn from_substrait_rel(
None => None,
};
// If join expression exists, parse the `on` condition expression,
build join and return
- // Otherwise, build join with koin filter, without join keys
+ // Otherwise, build join with only the filter, without join keys
match &join.expression.as_ref() {
Some(expr) => {
let on =
from_substrait_rex(expr, &in_join_schema,
extensions).await?;
let predicates = split_conjunction(&on);
// TODO: collect only one null_eq_null
- let join_exprs: Vec<(Column, Column, bool)> = predicates
- .iter()
- .map(|p| match p {
+ // The predicates can contain both equal and non-equal ops.
Review Comment:
I looked into the substrait spec, and it doesn't really talk about what
semantics of the `post_join_filter` is:
https://substrait.io/relations/logical_relations/#join-operation
https://github.com/search?q=repo%3Asubstrait-io%2Fsubstrait%20post_join_filter&type=code
There is a subtle distinction between non equality filters applied *during*
the join (in the `ON` clause) and applied post join for non-INNER joins: for
non inner joins the filters during the join don't filter out input rows (they
still come out, just with NULL matches)
> So the producer should only generate plans with None as post_join_filer.
This makes sense to me
For the consumer, there is already code in DataFusion that breaks up an
arbitrary `Expr` into equality predicates and others. This is how the SQL
frontend creates a Join (a single expr):
https://github.com/apache/arrow-datafusion/blob/a514b6752b063a5a3006aa114297520a933339b0/datafusion/sql/src/relation/join.rs#L134-L141
I think we could do the same here in the subtrait consumer which would be
much simpler, and would let the normal DataFusion optimization machinery work.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]