Github user tdas commented on a diff in the pull request:
https://github.com/apache/spark/pull/20631#discussion_r169429748
--- Diff: docs/structured-streaming-programming-guide.md ---
@@ -1051,6 +1053,16 @@ from the aggregation column.
For example, `df.groupBy("time").count().withWatermark("time", "1 min")`
is invalid in Append
output mode.
+##### Semantic Guarantees of Aggregation with Watermarking
+{:.no_toc}
+
+- A watermark delay (set with `withWatermark`) of "2 hours" guarantees
that the engine will never
+drop any data that is less than 2 hours delayed. In other words, any data
less than 2 hours behind
+(in terms of event-time) the latest data processed till then is guaranteed
to be aggregated.
+
+- However, the guarantee is strict only in one direction. Data delayed by
more than 2 hours is
+not guaranteed to be dropped; it may or may not get aggregated. More
delayed is the data, less
+likely is the engine going to process it.
--- End diff --
good catch. let me fix it.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]