xiedeyantu commented on code in PR #22370:
URL: https://github.com/apache/datafusion/pull/22370#discussion_r3281321403


##########
datafusion/optimizer/src/expand_join_or_predicate.rs:
##########
@@ -0,0 +1,174 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! [`ExpandJoinOrPredicate`] rewrites inner joins with OR filters into a 
UNION ALL
+//! of mutually exclusive hashjoin-capable inner joins.
+
+use crate::optimizer::ApplyOrder;
+use crate::{OptimizerConfig, OptimizerRule};
+use std::sync::Arc;
+
+use datafusion_common::tree_node::Transformed;
+use datafusion_common::Result;
+use datafusion_expr::logical_plan::{Join, LogicalPlan, Projection, Union};
+use datafusion_expr::utils::{can_hash, find_valid_equijoin_key_pair, 
split_binary_owned, split_conjunction_owned};
+use datafusion_expr::{Expr, ExprSchemable, JoinType, Operator};
+
+#[derive(Default, Debug)]
+pub struct ExpandJoinOrPredicate;
+
+impl ExpandJoinOrPredicate {

Review Comment:
   @2010YOUY01 Thank you so much for your detailed explanation of this logic! I 
think it's a very good idea, and for inner joins, this implementation is 
optimal, completing the entire logic directly at the physical execution layer. 
Regarding my current proposal, I support this implementation. However, since 
I'm not entirely clear on the execution-level logic, implementing it using 
DisjointHashJoinExec might take some time.
   
   Actually, there's 2 PRs(https://github.com/apache/calcite/pull/4300, 
https://github.com/apache/calcite/pull/4315) I submitted to Calcite, This rule 
file 
(https://github.com/apache/calcite/blob/main/core/src/main/java/org/apache/calcite/rel/rules/JoinExpandOrToUnionRule.java)
 will be more intuitive. where I implemented inner/left/right/full/anti joins 
(I didn't implement semi-join because its semantics are not easily split into 
multiple mutually exclusive joins). Their methods for splitting multiple join 
branches differ (refer to the comments in the connection code above). This 
isn't easily implemented using DisjointHashJoinExec; for example, left joins 
would be split into inner and anti joins. I limited the PR to inner joins 
because I wanted to implement the first step first, as the performance 
improvement was significant when testing joins of two tables (1000+ rows). I 
can't construct SQL to test other scenarios yet, as they all involve anti 
joins. If everyone accepts this soluti
 on, I will expand it to support more join types later. This is why I 
implemented it as a separate rule. There's another reason, which I also 
mentioned above: when the table only has 10 rows of data, the overhead becomes 
apparent, slowing down this optimization. Therefore, parameters or statistical 
information can help decide whether to rewrite it.
   
   Regarding your first question, I think we can achieve scan reuse at either 
the logical or physical layer; we don't need to worry too much about rewriting 
the logic.
   
   Regarding the second question, I think it might be difficult for us to do, 
perhaps because my understanding of DataFusion's execution layer is still 
limited.
   
   Regarding the third question, similar to the first, the complexity of the 
plan is not necessarily directly related to the actual execution logic or 
performance.
   
   This is just my personal opinion; please correct me if I'm wrong. Thank you 
very much for participating in the discussion.
   
   If we're only considering inner joins (without expanding to other join 
types), I personally like the second solution @Dandandan  mentioned. 
@2010YOUY01 If you've already implemented it, then I'll close this PR. If 
you're interested in further expansions of this PR, I can implement these 
capabilities through multiple PRs. Looking forward to your reply!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to