xiedeyantu commented on code in PR #22370:
URL: https://github.com/apache/datafusion/pull/22370#discussion_r3281321403


##########
datafusion/optimizer/src/expand_join_or_predicate.rs:
##########
@@ -0,0 +1,174 @@
+// Licensed to the Apache Software Foundation (ASF) under one
+// or more contributor license agreements.  See the NOTICE file
+// distributed with this work for additional information
+// regarding copyright ownership.  The ASF licenses this file
+// to you under the Apache License, Version 2.0 (the
+// "License"); you may not use this file except in compliance
+// with the License.  You may obtain a copy of the License at
+//
+//   http://www.apache.org/licenses/LICENSE-2.0
+//
+// Unless required by applicable law or agreed to in writing,
+// software distributed under the License is distributed on an
+// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+// KIND, either express or implied.  See the License for the
+// specific language governing permissions and limitations
+// under the License.
+
+//! [`ExpandJoinOrPredicate`] rewrites inner joins with OR filters into a 
UNION ALL
+//! of mutually exclusive hashjoin-capable inner joins.
+
+use crate::optimizer::ApplyOrder;
+use crate::{OptimizerConfig, OptimizerRule};
+use std::sync::Arc;
+
+use datafusion_common::tree_node::Transformed;
+use datafusion_common::Result;
+use datafusion_expr::logical_plan::{Join, LogicalPlan, Projection, Union};
+use datafusion_expr::utils::{can_hash, find_valid_equijoin_key_pair, 
split_binary_owned, split_conjunction_owned};
+use datafusion_expr::{Expr, ExprSchemable, JoinType, Operator};
+
+#[derive(Default, Debug)]
+pub struct ExpandJoinOrPredicate;
+
+impl ExpandJoinOrPredicate {

Review Comment:
   @2010YOUY01 Thank you so much for your detailed explanation of this logic! I 
think it's a very good idea, and for inner joins, this implementation is 
optimal, completing the entire logic directly at the physical execution layer. 
Regarding my current proposal, I support this implementation. However, since 
I'm not entirely clear on the execution-level logic, implementing it using 
DisjointHashJoinExec might take some time.
   
   Actually, there's 2 PRs(https://github.com/apache/calcite/pull/4300, 
https://github.com/apache/calcite/pull/4315) I submitted to Calcite, where I 
implemented inner/left/right/full/anti joins (I didn't implement semi-join 
because its semantics are not easily split into multiple mutually exclusive 
joins). Their methods for splitting multiple join branches differ (refer to the 
comments in the connection code above). This isn't easily implemented using 
DisjointHashJoinExec; for example, left joins would be split into inner and 
anti joins. I limited the PR to inner joins because I wanted to implement the 
first step first, as the performance improvement was significant when testing 
joins of two tables (1000+ rows). I can't construct SQL to test other scenarios 
yet, as they all involve anti joins. If everyone accepts this solution, I will 
expand it to support more join types later. This is why I implemented it as a 
separate rule. There's another reason, which I also mentioned above: when
  the table only has 10 rows of data, the overhead becomes apparent, slowing 
down this optimization. Therefore, parameters or statistical information can 
help decide whether to rewrite it.
   
   Regarding your first question, I think we can achieve scan reuse at either 
the logical or physical layer; we don't need to worry too much about rewriting 
the logic.
   
   Regarding the second question, I think it might be difficult for us to do, 
perhaps because my understanding of DataFusion's execution layer is still 
limited.
   
   Regarding the third question, similar to the first, the complexity of the 
plan is not necessarily directly related to the actual execution logic or 
performance.
   
   This is just my personal opinion; please correct me if I'm wrong. Thank you 
very much for participating in the discussion.
   
   If we're only considering inner joins (without expanding to other join 
types), I personally like the first solution @Dandandan  mentioned. @2010YOUY01 
If you've already implemented it, then I'll close this PR. If you're interested 
in further expansions of this PR, I can implement these capabilities through 
multiple PRs. Looking forward to your reply!



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to