Hey all,

We are planning implement watermarking in Structured Streaming that would
allow us handle late, out-of-order data better. Specially, when we are
aggregating over windows on event-time, we currently can end up keeping
unbounded amount data as state. We want to define watermarks on the event
time in order mark and drop data that are "too late" and accordingly age
out old aggregates that will not be updated any more.

To enable the user to specify details like lateness threshold, we are
considering adding a new method to Dataset. We would like to get more
feedback on this API. Here is the design doc

https://docs.google.com/document/d/1z-Pazs5v4rA31azvmYhu4I5xwqaNQl6Z
LIS03xhkfCQ/

Please comment on the design and proposed APIs.

Thank you very much!

TD

Reply via email to