silundong commented on code in PR #4725:
URL: https://github.com/apache/calcite/pull/4725#discussion_r2666893928


##########
core/src/main/java/org/apache/calcite/rel/core/Join.java:
##########
@@ -190,6 +191,13 @@ public JoinRelType getJoinType() {
             + " failures in condition " + condition);
       }
     }
+    if (joinType == JoinRelType.LEFT_MARK
+        && joinInfo.nullExclusionFlags.contains(true)
+        && !joinInfo.nonEquiConditions.isEmpty()) {
+      return litmus.fail("Left mark join is produced by rewriting 
IN/SOME/EXISTS "

Review Comment:
   I think mark join should only be created during removing subqueries and 
decorrelation phases, and at that point it strictly satisfies this constraint. 
I can't think of any other optimization that would violate this constraint; 
even if some optimization were to change the mark join condition (although I 
don't think this should happen), the constraint should still be satisfied — 
otherwise the transformation would likely be non‑equivalent.



##########
core/src/main/java/org/apache/calcite/rel/core/Join.java:
##########
@@ -260,9 +268,20 @@ protected int deepHashCode0() {
   }
 
   @Override protected RelDataType deriveRowType() {
+    RelDataTypeFactory typeFactory = getCluster().getTypeFactory();
+    RelDataType rightType = right.getRowType();
+    if (joinType == JoinRelType.LEFT_MARK) {

Review Comment:
   The output of a left mark join is `[left side columns, marker]`; it indeed 
has no direct relation to the right-side output. My intention was to reuse the 
`SqlValidatorUtil.deriveJoinRowType` method.
   
   Conceptually the marker’s type is a three-valued boolean, but it can be 
simplified to a two-valued boolean depending on the specific case. For example, 
for a left mark join produced by rewriting an EXISTS subquery, the marker is a 
two-valued boolean. If I simply set the marker type to nullable BOOLEAN, 
rewriting EXISTS subqueries leads to a type-mismatch error. Inside 
`deriveRowType` I no longer know which subquery the mark join was rewritten 
from, but that information is implicit in the condition — for example, after 
rewriting and decorrelating an EXISTS (correlated) subquery, the condition will 
only contain IS NOT DISTINCT FROM. Therefore I derive the marker type from the 
condition.
   
   Do I need to define a new method instead of resuing 
`SqlValidatorUtil.deriveJoinRowType`, or just add some comments here?



##########
linq4j/src/main/java/org/apache/calcite/linq4j/EnumerableDefaults.java:
##########
@@ -1685,6 +1686,279 @@ private static <TSource, TInner, TKey, TResult> 
Enumerable<TResult> hashJoinWith
     };
   }
 
+  /**
+   * Left mark join implementation based on hash. It will keep all rows from 
the left side and
+   * creates a new attribute to mark the rows from left input as having join 
partners from right
+   * side or not.
+   *
+   * <p> Left mark join is produced by rewriting IN/SOME/EXISTS subqueries, so 
left mark join will
+   * never contain both not null-safe join keys and non-equi predicates.
+   *
+   * @param outer                       Left input
+   * @param inner                       Right input
+   * @param outerKeyNullAwareSelector   Function that extracts keys from the 
row of left input
+   *                                    (return NULL when a not null-safe key 
has a NULL value)
+   * @param innerKeyNullAwareSelector   Function that extracts keys from the 
row of right input
+   *                                    (return NULL when a not null-safe key 
has a NULL value)
+   * @param outerNullSafeKeySelector    Function that extracts the null-safe 
keys from the row of
+   *                                    left input
+   * @param innerNullSafeKeySelector    Function that extracts the null-safe 
keys from the row of
+   *                                    right input
+   * @param atMostOneNotNullSafeKey     There is at most one not null-safe key 
in join keys
+   * @param resultSelector              Function that concats the row of left 
input and marker
+   * @param comparer                    Function that compares the keys
+   * @param nullSafeComparer            Function that compares the null-safe 
keys
+   * @param nonEquiPredicate            Non-equi predicate that can return NULL
+   * @param equiPredicate               Equi predicate that can return NULL
+   */
+  public static <TSource, TInner, TKey, TNsKey, TResult> Enumerable<TResult> 
leftMarkHashJoin(
+      final Enumerable<TSource> outer, final Enumerable<TInner> inner,
+      final Function1<TSource, TKey> outerKeyNullAwareSelector,
+      final Function1<TInner, TKey> innerKeyNullAwareSelector,
+      final @Nullable Function1<TSource, TNsKey> outerNullSafeKeySelector,
+      final @Nullable Function1<TInner, TNsKey> innerNullSafeKeySelector,
+      final boolean atMostOneNotNullSafeKey,
+      final Function2<TSource, @Nullable Boolean, TResult> resultSelector,
+      final @Nullable EqualityComparer<TKey> comparer,
+      final @Nullable EqualityComparer<TNsKey> nullSafeComparer,
+      final @Nullable NullablePredicate2<TSource, TInner> nonEquiPredicate,
+      final NullablePredicate2<TSource, TInner> equiPredicate) {
+    if (atMostOneNotNullSafeKey) {
+      return leftMarkHashJoinOptimized(outer, inner, outerKeyNullAwareSelector,
+          innerKeyNullAwareSelector, outerNullSafeKeySelector, 
innerNullSafeKeySelector,
+          resultSelector, comparer, nullSafeComparer, nonEquiPredicate);
+    }
+    return leftMarkHashJoinGeneral(outer, inner, outerKeyNullAwareSelector,
+        innerKeyNullAwareSelector, outerNullSafeKeySelector, 
innerNullSafeKeySelector,
+        resultSelector, comparer, nullSafeComparer, nonEquiPredicate, 
equiPredicate);
+  }
+
+  /**
+   * For other join types (especially INNER join), the hash table can be used 
to quickly determine
+   * which right-side rows should be joined with a given left-side row. But 
left mark join is more
+   * complicated. The <code>marker</code> indicates whether a left row has a 
join partner on the
+   * right side, it is a three-valued boolean, meaning we need to know the 
actual result (TRUE,
+   * FALSE, or NULL) of the join condition.
+   *
+   * <p> Join key comes in two categories:
+   * <ul>
+   *   <li>null-safe key (IS NOT DISTINCT FROM): it only produces 
TRUE/FALSE.</li>
+   *   <li>not null-safe key (EQUALS): it produces a three-valued boolean.</li>
+   * </ul>
+   *
+   * <p> If all join keys are null-safe, we can get the comparison result 
(TRUE or FALSE) based on
+   * hash table matching.
+   *
+   * <p> If there are multiple not null-safe join keys, the hash table 
matching alone is
+   * insufficient to determine the comparison result. However, if there is 
only one not null-safe
+   * key, we can record whether this key has any NULL values when constructing 
the hash table.
+   * During probing, if no match is found in the hash table and there are any 
NULL values on this
+   * unique not null-safe key, we can know that the <code>marker</code> is 
NULL.
+   */
+  static <TSource, TInner, TKey, TNsKey, TResult> Enumerable<TResult> 
leftMarkHashJoinOptimized(
+      final Enumerable<TSource> outer, final Enumerable<TInner> inner,
+      final Function1<TSource, TKey> outerKeyNullAwareSelector,
+      final Function1<TInner, TKey> innerKeyNullAwareSelector,
+      final @Nullable Function1<TSource, TNsKey> outerNullSafeKeySelector,
+      final @Nullable Function1<TInner, TNsKey> innerNullSafeKeySelector,
+      final Function2<TSource, @Nullable Boolean, TResult> resultSelector,
+      final @Nullable EqualityComparer<TKey> comparer,
+      final @Nullable EqualityComparer<TNsKey> nullSafeComparer,
+      final @Nullable NullablePredicate2<TSource, TInner> nonEquiPredicate) {
+    return new AbstractEnumerable<TResult>() {

Review Comment:
   The implementation in the paper targets the case of a single join key that 
is not null‑safe, which is a representative scenario. During the implementation 
of this PR, I found that additional handling is needed for multiple join keys 
(including both null-safe and not null-safe). The key point is whether the 
marker should be FALSE or NULL when the hash table match fails. The test cases 
also focus on different combinations of join keys, as well as the case that the 
join keys contain NULL.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to