Hi Danny, Yes, I agree that we don't need two variations of hash joins (EnumerableHashJoin and EnumerableThetaHashJoin) as it is the code right now.
Having that said, I am not sure if we can really handle the general case of theta joins with hash-based algorithms. The approach in [3] does not really handle general theta joins but allows to split a theta condition into equi and non-equi parts where the non-equi part is executed as a post-join filter. Since the post-join filter is part of the join operator we can say that we have a theta hash join operator but this is partially true. For instance, the following query: SELECT e1.name FROM emp e1 INNER JOIN emp e2 ON e1.salary < e2.salary cannot use at all hash join algorithm [3] because there is no equality condition. I wouldn't mind keeping EnumerableHashJoin only for equijoins and execute the post-join filter as such but this would make outer joins difficult/impossible to handle so I believe [3] is a good step forward. Best, Stamatis On Wed, Apr 24, 2019 at 6:28 AM Yuzhao Chen <[email protected]> wrote: > Julian, > > I also agree with you that we should distinguish between > EnumerableHashJoin and EnumerableMergeJoin, what i want to address about is > if we should extend the EnumerableHashJoin and EnumerableMergeJoin to let > them support Thera join conditions except the EquiJoin conditions. > > It seems that Stamatis also agree that we can extend the existing > EnumerableXXXs, correct me if i have wrong understanding. > > > Best, > Danny Chan > 在 2019年4月24日 +0800 AM2:18,Julian Hyde <[email protected]>,写道: > > I agree with Stamatis. > > > > Let’s suppose we have a merge join and hash join in enumerable > convention. The merge join requires its input to be sorted, the hash join > does not. They can both perform equi-join. The merge join can also perform > joins like > > > > FROM orders > > JOIN shipments > > ON shipments.shipDate BETWEEN orders.orderDate > > AND orders.orderDate + INTERVAL ‘7’ DAY > > > > Should we use the same sub-class of RelNode, EnumerableJoin, for both, > or should we create sub-classes EnumerableHashJoin and EnumerableMergeJoin? > > > > I think we should do the latter: > > * They are clearly different algorithms, and we generate different code > for them > > * They have different cost models > > * They have different input and output traits (the merge join requires > sorted inputs and produces a sorted output) > > * If we have an equi-join, we might want to generate both options and > see how they compare (the merge join might win if one or both of the inputs > are sorted), and both algorithms used the EnumerableJoin class we would not > be able to distinguish the RelNode instances. > > > > Julian > > > > > > > On Apr 23, 2019, at 11:06 AM, Stamatis Zampetakis <[email protected]> > wrote: > > > > > > In the literature, there are many algorithms to perform a join and the > vast > > > majority of them cannot handle theta conditions (either for performance > > > reasons, or because the algorithm was simply not meant for it). > > > > > > For the existing EnumerableXXXJoin variants, it may be possible to > extend > > > the algorithm to handle theta joins without structural changes to the > > > algorithm (as I think it is the case in [3]). > > > For these cases, I agree with you it does not make sense to have many > > > variants (e.g., EnumerableThetaHashJoin seems redundant). > > > > > > In the future we may decide to use other hash-based implementations > with > > > certain applicability restrictions (such as apply only on equijoins) > but > > > let's not worry too much about that now. > > > > > > Regarding the rules, I think it is useful to have different variants. > > > Consider for example a system that does not have join processing > algorithms > > > that can treat theta-joins; pushing non-equi conditions in a join seems > > > undesirable in this case. > > > > > > For more information regarding join processing, there are a few > interesting > > > (and old) surveys [4, 5] which outline some of the most typical > algorithms > > > that are used in relational databases. > > > > > > Best, > > > Stamatis > > > > > > [4] Query evaluation techniques for Large databases ( > > > https://web.stanford.edu/class/cs346/2014/graefe.pdf) > > > [5] Join processing in relational databases ( > > > https://www.csd.uoc.gr/~hy460/pdf/p63-mishra.pdf) > > > > > > > > > On Tue, Apr 23, 2019 at 8:05 AM Yuzhao Chen <[email protected]> > wrote: > > > > > > > Thx, Julian > > > > > > > > Why not just support non-equi join condition for every physical > algorithm, > > > > it does not make much sense if we have both HashJoin and a > HashTheraJoin, > > > > cause a HashThataJoin with empty non-equi join condition is same as a > > > > HashJoin. > > > > > > > > And we can remove the limitations in the rule like FilterJoinRule. > > > > > > > > Best, > > > > > > > > Danny Chan > > > > 在 2019年4月23日 +0800 AM3:21,[email protected],写道: > > > > > > > > > > If there are limitations, over time we would like to remove those > > > > limitations, but we will probably do it by adding new algorithms, and > > > > therefore new EnumerableXxx classes. > > > > > > >
