Hi,

Since we have commited CALCITE-2969, and in the next release of 1.21, we will 
deprecate EquiJoin and make enumerable hash join support non-equi join 
condition, the optimization door is open to us.

Currently Calcite often generates complicated and inefficient plan for boolean 
context correlated subquery (putting value context subquery aside, which is 
more complex during unnesting). e.g.
https://issues.apache.org/jira/browse/CALCITE-2857
https://issues.apache.org/jira/browse/CALCITE-2874

There are many more cases if you want to try and find out. Some underlying 
reasons are:
1. SemJoin doesn't support non-equi condition
2. Correlate comes with Aggregate in many cases

Sometimes, SubqueryRemoveRule generates Correlate with Aggregate, sometimes 
generates Join with Aggregate. Then we use SemiJoinRule to match Join(Rel, Agg) 
pattern to see whether we can transform it to semijoin or not. This is counter 
intuitive, because we already know it will be a SemiJoin or AntiSemi when we 
generate Correlate for the subquery. In current way, we may miss the chance to 
get back SemiJoin for it, and we almost lose the opportunity to do IndexScan if 
there is index available.

In addition, I see 2 rules JoinAddRedundantSemiJoinRule and SemiJoinRemoveRule 
seem to provide the chance to do index scan. But through the test cases related 
with the 2 rules, I don't see any chance for a better plan with them and they 
don't seem to be able to generate plan with index scan. I doubt whether they 
are used at all. Can someone shed some light on this and give us some examples, 
if I get it wrong?

The transformation flow for boolean context sub-query in my head is:
sub-query -> Correate(Semi/AntiSemi) without Aggregate (SubqueryRemoveRule)
Correlate(Semi/AntiSemi) -> NestedLoopJoin (correlate implementation rule)
Correlate(Semi/AntiSemi) -> IndexedNestedLoopJoin (index apply rule)
Correlate(Semi/AntiSemi) -> SemiJoin/AntiSemiJoin (Semi Correlate to Semi Join 
rule)
SemiJoin -> InnerJoin(rel, Agg) (SemiJoin to Join rule)
InnerJoin(rel, Agg) ->  InnerJoin(Agg, rel) (Join reorder rule)
SemiJoin/InnerJoin -> HashJoin/MergeJoin/NLJ (Join implementation rule)

This is a long shot, and will involve tons of changes.
Any thoughts?

- Haisheng Yuan

Reply via email to