Github user tdas commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20631#discussion_r169429748
  
    --- Diff: docs/structured-streaming-programming-guide.md ---
    @@ -1051,6 +1053,16 @@ from the aggregation column.
     For example, `df.groupBy("time").count().withWatermark("time", "1 min")` 
is invalid in Append 
     output mode.
     
    +##### Semantic Guarantees of Aggregation with Watermarking
    +{:.no_toc}
    +
    +- A watermark delay (set with `withWatermark`) of "2 hours" guarantees 
that the engine will never
    +drop any data that is less than 2 hours delayed. In other words, any data 
less than 2 hours behind
    +(in terms of event-time) the latest data processed till then is guaranteed 
to be aggregated.
    +
    +- However, the guarantee is strict only in one direction. Data delayed by 
more than 2 hours is
    +not guaranteed to be dropped; it may or may not get aggregated. More 
delayed is the data, less
    +likely is the engine going to process it.
    --- End diff --
    
    good catch. let me fix it.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to