Github user tdas commented on a diff in the pull request:
https://github.com/apache/spark/pull/16294#discussion_r93547362
--- Diff: docs/structured-streaming-programming-guide.md ---
@@ -671,12 +678,114 @@ windowedCounts = words.groupBy(
</div>
+### Handling Late Data and Watermarking
Now consider what happens if one of the events arrives late to the
application.
For example, a word that was generated at 12:04 but it was received at
12:11.
-Since this windowing is based on the time in the data, the time 12:04
should be considered for windowing. This occurs naturally in our window-based
grouping â the late data is automatically placed in the proper windows and
the correct aggregates are updated as illustrated below.
+Since this windowing is based on the time in the data, the time 12:04
should be considered for
+windowing. This occurs naturally in our window-based grouping â the late
data is
+automatically placed in the proper windows and the correct aggregates are
updated as illustrated below.

+Furthermore, since Spark 2.1, you can define a watermark on the event
time, and specify the threshold
+on how late the date can be in terms of the event time. The engine will
automatically track the
+event time and drop any state that is related to old windows that are not
expected to receive older
+than (max event time seen - late threshold). This allows the engine to
bound the size of the state
+that is needed for calculating windowed aggregates. For example, we can
apply watermarking to the
+previous example as follows.
+
+<div class="codetabs">
+<div data-lang="scala" markdown="1">
+
+{% highlight scala %}
+import spark.implicits._
--- End diff --
I wasnt planning to adding examples in this PR to keep this just about
docs. I am happy to see simple examples contributed as PRs from the community.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]