HeartSaVioR commented on a change in pull request #31572:
URL: https://github.com/apache/spark/pull/31572#discussion_r577145290
##########
File path: docs/structured-streaming-programming-guide.md
##########
@@ -1415,6 +1415,18 @@ generation of the outer result may get delayed if there
no new data being receiv
*In short, if any of the two input streams being joined does not receive data
for a while, the
outer (both cases, left or right) output may get delayed.*
+##### Semi Joins with Watermarking
+A semi join returns values from the left side of the relation that has a match
with the right.
+It is also referred to as a left semi join. Similar to outer joins, watermark
+ event-time
+constraints must be specified for semi join. This is because for not
generating result for an input
Review comment:
I see the sentence is a slightly modified one from outer join, but
doesn't seem to be intuitive for left semi case. Watermark constraint wouldn't
be a requirement for left semi join if we allow Spark to buffer entire input
rows in state, and the result would be the same. (Please correct me if I'm
missing here.)
The main purpose is to evict input rows on left side which are no longer
possibly matching with the right side, so we'd be better emphasizing the
eviction as that's the main purpose. What about
"This is to evict unmatched input rows on left side, the engine must know
when an input row on left side is not going to match with anything on right
side in future."
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]