neilramaswamy commented on code in PR #44323:
URL: https://github.com/apache/spark/pull/44323#discussion_r1582300122
##########
sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinHelper.scala:
##########
@@ -219,10 +222,41 @@ object StreamingSymmetricHashJoinHelper extends Logging {
attributesWithEventWatermark =
AttributeSet(otherSideInputAttributes),
condition,
eventTimeWatermarkForEviction)
- val inputAttributeWithWatermark =
oneSideInputAttributes.find(_.metadata.contains(delayKey))
- val expr = watermarkExpression(inputAttributeWithWatermark,
stateValueWatermark)
- expr.map(JoinStateValueWatermarkPredicate.apply _)
+ // If the condition itself is empty (for example, left_time <
left_time + INTERVAL ...),
+ // then we will not have generated a stateValueWatermark.
+ if (stateValueWatermark.isEmpty) {
+ None
+ } else {
+ // For example, if the condition is of the form:
+ // left_time > right_time + INTERVAL 30 MINUTES
+ // Then this extracts left_time and right_time.
+ val attributesInCondition = AttributeSet(
+ condition.get.collect { case a: AttributeReference => a }
+ )
+
+ // Construct an AttributeSet so that we can perform equality between
attributes,
+ // which we do in the filter below.
+ val oneSideInputAttributeSet = AttributeSet(oneSideInputAttributes)
+
+ // oneSideInputAttributes could be [left_value, left_time], and we
just
+ // want the attribute _in_ the time-interval condition.
+ val oneSideStateWatermarkAttributes = attributesInCondition.filter {
a =>
+ oneSideInputAttributeSet.contains(a)
+ }
+
+ // There should be a single attribute per side in the time-interval
condition, so,
+ // filtering for oneSideInputAttributes as done above should lead us
with 1 attribute.
+ if (oneSideStateWatermarkAttributes.size == 1) {
+ val expr =
+ watermarkExpression(Some(oneSideStateWatermarkAttributes.head),
stateValueWatermark)
+ expr.map(JoinStateValueWatermarkPredicate.apply _)
+ } else {
+ // This should never happen, since the grammar will ensure that we
have one attribute
Review Comment:
Good question. I thought more about this, and I actually think that I might
be wrong in the case of an edge-case we don't have in any of our tests: if the
user does:
`left_time > right_time + m AND other_left_time > right_time + n`, there
will be _three_ attributes in the condition. Then,
`oneSideStateWatermarkAttributes.size` will be 2 (it will be `left_time` and
`other_left_time`, neither of which are watermark attributes), and the
condition that we need to return would be a conjunctive watermark predicate:
`left_time <= watermark(right) + m AND other_left_time <= watermark(right) + n`.
We can remove state, but I'm pretty sure the current implementation in Spark
master would fail. I need to check this. It might be out-of-scope for this PR.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]