It seems that the discusion has somehow converged (at least to the major points). I created CALCITE-2969, for whoever decides to tackle this issue.
[1] https://issues.apache.org/jira/browse/CALCITE-2969 Στις Δευ, 25 Μαρ 2019 στις 8:57 μ.μ., ο/η Julian Hyde <[email protected]> έγραψε: > Generally +1 on what Haisheng says. Specifically: > > I like the idea of renaming EnumerableCorrelate, and making it not extend > Correlate. I would choose EnumerableNestedLoopJoin rather than > EnumerableNestLoopJoin. > > Shifting from LogicalCorrelate to LogicalApply is worth considering. > LogicalApply is similar to the “map” operator in functional programming, or > “selectMany” in LINQ, so is very well-behaved and powerful - a good > abstraction. > > Regarding SemiJoin and EquiJoin. Maybe we could deprecate them, or maybe > we could convert them to interfaces. I’ll leave that decision to whoever > actually writes the code. If we moved a few things to interfaces (including > JoinLike I mentioned earlier) maybe we’d get out of the gridlock caused by > the type hierarchy. > > Regarding when to decorrelate. Decorrelation during sql-to-rel is legacy. > We now prefer to decorrelate using rules, in RelNode-land. There may be > bugs in the legacy decorrelation and we do not aggressively fix them. We > can even start to remove functionality if it helps us make > SqlToRelConverter simpler. > > Julian > > > > > > > > > > On Mar 25, 2019, at 12:23 PM, Haisheng Yuan <[email protected]> > wrote: > > > > I agree with Stamatis that JoinRelType should have values: > > Inner, Left (Outer), Full (Outer), Semi, Anti. > > The option of right outer join is not necessary, because we can flip the > inner/outer to left outer join. > > > > SemiJoin and EquiJoin can be deprecated. > > > > EnumerableCorrelate is confusing, correlate is a logical concept, better > to rename it to EnumerableNestLoopJoin. SemiJoin can be implemented by > nestloop, hashjoin or mergejoin. > > I don’t see the necessity of having a separate physical operator > EnumerableSemiJoin. > > But these are minor naming issue. > > > > Regarding the LogicalCorrelate, I view it as a kind of operator similar > to LogicalApply [1], which is > > the logical operator in Microsoft SQL Server and Greenplum Orca > optimizer. Both uses LogicalApply > > operator to represent the correlated join that inner has reference to > the outer variable. The apply may > > have different type: cross apply (or inner apply), outer apply, semi > apply, anti-semi apply. They are > > just subset of join types, maybe it is why it is acciociated with > JoinRelType, or reuse. The main > > difference between Correlate (or Apply) and Join is (logically > speaking): In Correlate, inner has > > reference to outer. In Join, inner doesn’t reference outer. NestLoopJoin > can implement both. > > > > With optimizer transformation rules, Correlate (or Apply) can be > transformed into a Join, or a Join > > is transfomed into a Correlate (Or Apply), in case there is an index can > be used in inner relation. > > > > What I am not comfortable with is: > > In SQL Server and GPDB Orca optimzier, Sql is translated into logical > relation as it should be ( > > keep subquery as it is), then use all kinds of apply rules to unnest > subqueries based on cost model, > > which seems reasonable to me. > > But in Calcite, we can not only decorrelate in SqlToRel stage, but also > can do it in SubqueryRemoveRule. > > Should we unify them all in the rules and keep SqlToRelConverter simple? > > > > Thanks ~ > > Haisheng Yuan > > ------------------------------------------------------------------ > > 发件人:Stamatis Zampetakis<[email protected]> > > 日 期:2019年03月23日 07:31:35 > > 收件人:<[email protected]> > > 主 题:Re: Join, SemiJoin, Correlate > > > > Since we are discussing this topic I thought it would be could to bring > > back > > to the surface a similar discussion [1] that has been done a few years > ago > > in this list. > > > > I am leaning towards option 3 where JoinRelType has all necessary values: > > Inner, Left, Semi, Anti, and Full. > > With these changes it seems we could remove (deprecate) also SemiJoin, > and > > EquiJoin. > > > > On the physical level we could have: > > 1. EnumerableCorrelate or EnumerableNestedLoopJoin; > > 2. EnumerableMergeJoin; > > 3. EnumerableHashJoin (currently EnumerableJoin) > > > > and for the above we could pass the JoinRelType throwing an exception > when > > the specific algorithm cannot be used to implement a specific type of > join. > > > > EnumerableSemiJoin and EnumerableThetaJoin could also be removed and > > covered from the above I think. > > > > Regarding Correlate and LogicalCorrelate, I am not sure what should we > do. > > Associating the JoinRelType with it does not seem right, and making > > Correlate also a Join is not very attractive either. > > > > Best, > > Stamatis > > > > [1] > > > http://mail-archives.apache.org/mod_mbox/calcite-dev/201411.mbox/%3CCAB%3DJe-H7AWEHbKzjrRHd-YcZgkgWzFORALrz_mMc2k7WDdj54Q%40mail.gmail.com%3E > > > > > > Στις Πέμ, 21 Μαρ 2019 στις 10:35 μ.μ., ο/η Walaa Eldin Moustafa < > > [email protected]> έγραψε: > > > >>> Is your concern with how we have structured the class hierarchy? Or > just > >> how we describe Correlate in the documentation? > >> > >> My concern is with both, but mainly the former. > >> > >>> I do agree that Correlate and nested loops joins are not the same (one > >> is logical, the other physical). However, they have a lot in common, in > >> particular the fact that one input sets variables and the input reads > those > >> variables. > >> > >> I think this commonality describes how the query is written, but not > >> necessarily what it is logically equivalent to. It also describes the > >> "how", and not necessarily the "what". I would say logical > >> representations should be concerned with the "what" part. > >> > >>> I can’t think of any way to represent a nested loops join (e.g. for > each > >> department, find all employees in that department) that does not use > >> variables to tie together the two inputs. And therefore I am happy with > the > >> fact that our Java implementation of nested-loops join is ‘class > >> EnumerableCorrelate extends Correlate’. > >> > >> That is correct. The two variables are required. At the logical level > >> they are mapped to the Correlate variables, or the Join keys after > >> decorrelation. After going to physical, we can only have join keys. > >> One of the keys can be the basis for the outer loop and the other for > >> the inner loop if needed. That is true for both Correlate and Join > >> operators. Both keys can even be used in another way than forming > >> nested loops such as using them to implement hash or merge joins > >> (again for regular Join or Correlate join after decorrelation). > >> > >> Thanks, > >> Walaa. > >> > >> On Thu, Mar 21, 2019 at 2:08 PM Julian Hyde <[email protected]> wrote: > >>> > >>>> In addition, I would not present Correlate > >>>> as a nested loops join. > >>> > >>> > >>> Is your concern with how we have structured the class hierarchy? Or > just > >> how we describe Correlate in the documentation? > >>> > >>> I do agree that Correlate and nested loops joins are not the same (one > >> is logical, the other physical). However, they have a lot in common, in > >> particular the fact that one input sets variables and the input reads > those > >> variables. > >>> > >>> I can’t think of any way to represent a nested loops join (e.g. for > each > >> department, find all employees in that department) that does not use > >> variables to tie together the two inputs. And therefore I am happy with > the > >> fact that our Java implementation of nested-loops join is ‘class > >> EnumerableCorrelate extends Correlate’. > >>> > >>> > >>> Julian > >>> > >>>> On Mar 21, 2019, at 1:12 PM, Walaa Eldin Moustafa < > >> [email protected]> wrote: > >>>> > >>>> I would vote for number 3. In addition, I would not present Correlate > >>>> as a nested loops join. Moreover, nested loops, hash and merge joins > >>>> should be able to map to both Join or Correlate logical ones when > >>>> possible (no inherent correlation between logical join type and > >>>> physical types). > >>>> > >>>> On Thu, Mar 21, 2019 at 11:55 AM Julian Hyde <[email protected]> > wrote: > >>>>> > >>>>> I have a few ideas for refactorings. (I’m not convinced by any of > >> them, but let me know which you like.) > >>>>> > >>>>> 1. Get rid of SemiJoinType. It is mis-named (it is not used by > >> SemiJoin, it is used by Correlate, but in a field called joinType). > >>>>> > >>>>> 2. In Correlate, use org.apache.calcite.linq4j.CorrelateJoinType. It > >> has the same set of values as SemiJoinType, but it has a better name. > >>>>> > >>>>> 3. Get rid of both SemiJoinType and CorrelateJoinType, and use > >> JoinRelType for everything. We would have to add SEMI and ANTI values. > Also > >> some methods to find out whether the resulting row type contains fields > >> from the left and right inputs or just the left input. > >>>>> > >>>>> 4. Add “interface JoinLike extends BiRel” and make Join, SemiJoin and > >> Correlate implement it. It would have a methods that say whether the LHS > >> and RHS generate nulls, and whether the output row type contains columns > >> from the right input. This seems attractive because it lets Join, > SemiJoin > >> and Correlate continue to be structurally different. > >>>>> > >>>>> Julian > >>>>> > >>>>> > >>>>> > >>>>> > >>>>>> On Mar 20, 2019, at 6:55 PM, Haisheng Yuan <[email protected]> > >> wrote: > >>>>>> > >>>>>> SubPlan (in Postgres’ term) is a Postgres physical relational node > >> to evaluate correlated subquery. What I mean is correlated subquery that > >> can’t be decorrelated can’t be implemented by hashjoin or mergejoin. > But it > >> is off topic. > >>>>>> > >>>>>> Thanks ~ > >>>>>> Haisheng Yuan > >>>>>> ------------------------------------------------------------------ > >>>>>> 发件人:Walaa Eldin Moustafa<[email protected]> > >>>>>> 日 期:2019年03月21日 09:31:41 > >>>>>> 收件人:<[email protected]> > >>>>>> 抄 送:Stamatis Zampetakis<[email protected]> > >>>>>> 主 题:Re: Re: Join, SemiJoin, Correlate > >>>>>> > >>>>>> Agreed with Stamatis. Currently: 1) Correlate is tied to IN, EXISTS, > >>>>>> NOT IN, NOT EXISTS, and 2) is used as an equivalent to nested loops > >>>>>> join. The issues here are: 1) IN, EXISTS, NOT IN, NOT EXISTS can be > >>>>>> rewritten as semi/anti joins, and 2) nested loops join is more of a > >>>>>> physical operator. > >>>>>> > >>>>>> It seems that the minimal set of logical join types are INNER, LEFT, > >>>>>> RIGHT, OUTER, SEMI, ANTI. > >>>>>> > >>>>>> So I think Calciate could have one LogicalJoin operator with an > >>>>>> attribute to specify the join type (from the above), and a number of > >>>>>> physical join operators (hash, merge, nested loops) whose > >>>>>> implementation details depend on the the join type. > >>>>>> > >>>>>> What we lose by this model is the structure of the query (whether > >>>>>> there was a sub-plan or not), but I would say that this is actually > >>>>>> what is desired from a logical representation -- to abstract away > >> from > >>>>>> how the query is written, and how it is structured, as long as there > >>>>>> is a canonical representation. There could also be a world where > both > >>>>>> models coexist (Correlate first then Decorrelate but in the light of > >> a > >>>>>> single logical join operator?). > >>>>>> > >>>>>> @Haisheng, generally, a sub-plan can also be implemented using a > >>>>>> variant of hash or merge joins as long as we evaluate the sub-plan > >>>>>> independently (without the join predicate), but that is up to the > >>>>>> optimizer. > >>>>>> > >>>>>> Thanks, > >>>>>> Walaa. > >>>>>> > >>>>>> On Wed, Mar 20, 2019 at 5:23 PM Haisheng Yuan < > >> [email protected]> wrote: > >>>>>>> > >>>>>>> SemiJoinType and its relationship with JoinRelType do confuse me a > >> little bit. > >>>>>>> > >>>>>>> But I don’t think we should not have LogicalCorrelate. It is useful > >> to represent the lateral or correlated subquery (aka SubPlan in Postgres > >> jargon). The LogicalCorrelate can be implemented as NestLoopJoin in > >> Calcite, or SubPlan in Postgres’s terminology, but it can’t be > implemented > >> as HashJoin or MergeJoin. > >>>>>>> > >>>>>>> Thanks ~ > >>>>>>> Haisheng Yuan > >>>>>>> ------------------------------------------------------------------ > >>>>>>> 发件人:Stamatis Zampetakis<[email protected]> > >>>>>>> 日 期:2019年03月21日 07:13:15 > >>>>>>> 收件人:<[email protected]> > >>>>>>> 主 题:Re: Join, SemiJoin, Correlate > >>>>>>> > >>>>>>> I have bumped into this quite a few times and I think we should > >> really try > >>>>>>> to improve the design of the join hierarchy. > >>>>>>> > >>>>>>> From a logical point of view I think it makes sense to have the > >> following > >>>>>>> operators: > >>>>>>> InnerJoin, LeftOuterJoin, FullOuterJoin, SemiJoin, AntiJoin, > >> (GroupJoin) > >>>>>>> > >>>>>>> Yet I have not thought thoroughly what should become a class, and > >> what a > >>>>>>> property of the class (e.g., JoinRelType, SemiJoinType). > >>>>>>> > >>>>>>> Moreover, Correlate as it is right now, is basically a nested loop > >> join (as > >>>>>>> its Javadoc also indicates). > >>>>>>> Nested loop join is most often encountered as a physical operator > >> so I am > >>>>>>> not sure if it should remain as is (in particular the > >> LogicalCorrelate). > >>>>>>> As we do not have HashJoin, MergeJoin, etc., operators at the > >> logical > >>>>>>> level, I think we should not have a NestedLoopJoin (aka., > >> LogicalCorrelate). > >>>>>>> There are valid reasons why Correlate was introduced in the first > >> place but > >>>>>>> I think we should rethink a bit the design and the needs. > >>>>>>> > >>>>>>> @Julian: I do not know to what extend you would like to rethink the > >>>>>>> hierarchy but I have the impression that even small changes can > >> easily > >>>>>>> break backward compatibility. > >>>>>>> > >>>>>>> > >>>>>>> Στις Τετ, 20 Μαρ 2019 στις 8:07 μ.μ., ο/η Julian Hyde < > >> [email protected]> > >>>>>>> έγραψε: > >>>>>>> > >>>>>>>> I just discovered that Correlate, which is neither a Join nor a > >> SemiJoin, > >>>>>>>> uses SemiJoinType, but SemiJoin does not use SemiJoinType. > >>>>>>>> > >>>>>>>> Yuck. The Join/SemiJoin/Correlate type hierarchy needs some > >> thought. > >>>>>>>> > >>>>>>>> Julian > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>> > >>>>> > >>> > >> > > > >
