silundong commented on code in PR #4725:
URL: https://github.com/apache/calcite/pull/4725#discussion_r2666893928
##########
core/src/main/java/org/apache/calcite/rel/core/Join.java:
##########
@@ -190,6 +191,13 @@ public JoinRelType getJoinType() {
+ " failures in condition " + condition);
}
}
+ if (joinType == JoinRelType.LEFT_MARK
+ && joinInfo.nullExclusionFlags.contains(true)
+ && !joinInfo.nonEquiConditions.isEmpty()) {
+ return litmus.fail("Left mark join is produced by rewriting
IN/SOME/EXISTS "
Review Comment:
I think mark join should only be created during removing subqueries and
decorrelation phases, and at that point it strictly satisfies this constraint.
I can't think of any other optimization that would violate this constraint;
even if some optimization were to change the mark join condition (although I
don't think this should happen), the constraint should still be satisfied —
otherwise the transformation would likely be non‑equivalent.
##########
core/src/main/java/org/apache/calcite/rel/core/Join.java:
##########
@@ -260,9 +268,20 @@ protected int deepHashCode0() {
}
@Override protected RelDataType deriveRowType() {
+ RelDataTypeFactory typeFactory = getCluster().getTypeFactory();
+ RelDataType rightType = right.getRowType();
+ if (joinType == JoinRelType.LEFT_MARK) {
Review Comment:
The output of a left mark join is `[left side columns, marker]`; it indeed
has no direct relation to the right-side output. My intention was to reuse the
`SqlValidatorUtil.deriveJoinRowType` method.
Conceptually the marker’s type is a three-valued boolean, but it can be
simplified to a two-valued boolean depending on the specific case. For example,
for a left mark join produced by rewriting an EXISTS subquery, the marker is a
two-valued boolean. If I simply set the marker type to nullable BOOLEAN,
rewriting EXISTS subqueries leads to a type-mismatch error. Inside
`deriveRowType` I no longer know which subquery the mark join was rewritten
from, but that information is implicit in the condition — for example, after
rewriting and decorrelating an EXISTS (correlated) subquery, the condition will
only contain IS NOT DISTINCT FROM. Therefore I derive the marker type from the
condition.
Do I need to define a new method instead of resuing
`SqlValidatorUtil.deriveJoinRowType`, or just add some comments here?
##########
linq4j/src/main/java/org/apache/calcite/linq4j/EnumerableDefaults.java:
##########
@@ -1685,6 +1686,279 @@ private static <TSource, TInner, TKey, TResult>
Enumerable<TResult> hashJoinWith
};
}
+ /**
+ * Left mark join implementation based on hash. It will keep all rows from
the left side and
+ * creates a new attribute to mark the rows from left input as having join
partners from right
+ * side or not.
+ *
+ * <p> Left mark join is produced by rewriting IN/SOME/EXISTS subqueries, so
left mark join will
+ * never contain both not null-safe join keys and non-equi predicates.
+ *
+ * @param outer Left input
+ * @param inner Right input
+ * @param outerKeyNullAwareSelector Function that extracts keys from the
row of left input
+ * (return NULL when a not null-safe key
has a NULL value)
+ * @param innerKeyNullAwareSelector Function that extracts keys from the
row of right input
+ * (return NULL when a not null-safe key
has a NULL value)
+ * @param outerNullSafeKeySelector Function that extracts the null-safe
keys from the row of
+ * left input
+ * @param innerNullSafeKeySelector Function that extracts the null-safe
keys from the row of
+ * right input
+ * @param atMostOneNotNullSafeKey There is at most one not null-safe key
in join keys
+ * @param resultSelector Function that concats the row of left
input and marker
+ * @param comparer Function that compares the keys
+ * @param nullSafeComparer Function that compares the null-safe
keys
+ * @param nonEquiPredicate Non-equi predicate that can return NULL
+ * @param equiPredicate Equi predicate that can return NULL
+ */
+ public static <TSource, TInner, TKey, TNsKey, TResult> Enumerable<TResult>
leftMarkHashJoin(
+ final Enumerable<TSource> outer, final Enumerable<TInner> inner,
+ final Function1<TSource, TKey> outerKeyNullAwareSelector,
+ final Function1<TInner, TKey> innerKeyNullAwareSelector,
+ final @Nullable Function1<TSource, TNsKey> outerNullSafeKeySelector,
+ final @Nullable Function1<TInner, TNsKey> innerNullSafeKeySelector,
+ final boolean atMostOneNotNullSafeKey,
+ final Function2<TSource, @Nullable Boolean, TResult> resultSelector,
+ final @Nullable EqualityComparer<TKey> comparer,
+ final @Nullable EqualityComparer<TNsKey> nullSafeComparer,
+ final @Nullable NullablePredicate2<TSource, TInner> nonEquiPredicate,
+ final NullablePredicate2<TSource, TInner> equiPredicate) {
+ if (atMostOneNotNullSafeKey) {
+ return leftMarkHashJoinOptimized(outer, inner, outerKeyNullAwareSelector,
+ innerKeyNullAwareSelector, outerNullSafeKeySelector,
innerNullSafeKeySelector,
+ resultSelector, comparer, nullSafeComparer, nonEquiPredicate);
+ }
+ return leftMarkHashJoinGeneral(outer, inner, outerKeyNullAwareSelector,
+ innerKeyNullAwareSelector, outerNullSafeKeySelector,
innerNullSafeKeySelector,
+ resultSelector, comparer, nullSafeComparer, nonEquiPredicate,
equiPredicate);
+ }
+
+ /**
+ * For other join types (especially INNER join), the hash table can be used
to quickly determine
+ * which right-side rows should be joined with a given left-side row. But
left mark join is more
+ * complicated. The <code>marker</code> indicates whether a left row has a
join partner on the
+ * right side, it is a three-valued boolean, meaning we need to know the
actual result (TRUE,
+ * FALSE, or NULL) of the join condition.
+ *
+ * <p> Join key comes in two categories:
+ * <ul>
+ * <li>null-safe key (IS NOT DISTINCT FROM): it only produces
TRUE/FALSE.</li>
+ * <li>not null-safe key (EQUALS): it produces a three-valued boolean.</li>
+ * </ul>
+ *
+ * <p> If all join keys are null-safe, we can get the comparison result
(TRUE or FALSE) based on
+ * hash table matching.
+ *
+ * <p> If there are multiple not null-safe join keys, the hash table
matching alone is
+ * insufficient to determine the comparison result. However, if there is
only one not null-safe
+ * key, we can record whether this key has any NULL values when constructing
the hash table.
+ * During probing, if no match is found in the hash table and there are any
NULL values on this
+ * unique not null-safe key, we can know that the <code>marker</code> is
NULL.
+ */
+ static <TSource, TInner, TKey, TNsKey, TResult> Enumerable<TResult>
leftMarkHashJoinOptimized(
+ final Enumerable<TSource> outer, final Enumerable<TInner> inner,
+ final Function1<TSource, TKey> outerKeyNullAwareSelector,
+ final Function1<TInner, TKey> innerKeyNullAwareSelector,
+ final @Nullable Function1<TSource, TNsKey> outerNullSafeKeySelector,
+ final @Nullable Function1<TInner, TNsKey> innerNullSafeKeySelector,
+ final Function2<TSource, @Nullable Boolean, TResult> resultSelector,
+ final @Nullable EqualityComparer<TKey> comparer,
+ final @Nullable EqualityComparer<TNsKey> nullSafeComparer,
+ final @Nullable NullablePredicate2<TSource, TInner> nonEquiPredicate) {
+ return new AbstractEnumerable<TResult>() {
Review Comment:
The implementation in the paper targets the case of a single join key that
is not null‑safe, which is a representative scenario. During the implementation
of this PR, I found that additional handling is needed for multiple join keys
(including both null-safe and not null-safe). The key point is whether the
marker should be FALSE or NULL when the hash table match fails. The test cases
also focus on different combinations of join keys, as well as the case that the
join keys contain NULL.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]