Github user bersprockets commented on a diff in the pull request:
https://github.com/apache/spark/pull/20631#discussion_r169146175
--- Diff: docs/structured-streaming-programming-guide.md ---
@@ -1051,6 +1053,16 @@ from the aggregation column.
For example, `df.groupBy("time").count().withWatermark("time", "1 min")`
is invalid in Append
output mode.
+##### Semantic Guarantees of Aggregation with Watermarking
+{:.no_toc}
+
+- A watermark delay (set with `withWatermark`) of "2 hours" guarantees
that the engine will never
+drop any data that is less than 2 hours delayed. In other words, any data
less than 2 hours behind
+(in terms of event-time) the latest data processed till then is guaranteed
to be aggregated.
+
+- However, the guarantee is strict only in one direction. Data delayed by
more than 2 hours is
+not guaranteed to be dropped; it may or may not get aggregated. More
delayed is the data, less
+likely is the engine going to process it.
--- End diff --
> However, the guarantee is strict only in one direction. Data delayed by
more than 2 hours is not guaranteed to be dropped
This might contradict an earlier statement, from "Handling Late Data and
Watermarking", that says
"In other words, late data within the threshold will be aggregated, but
data later than the threshold will be dropped"
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]