Hey all, We are planning implement watermarking in Structured Streaming that would allow us handle late, out-of-order data better. Specially, when we are aggregating over windows on event-time, we currently can end up keeping unbounded amount data as state. We want to define watermarks on the event time in order mark and drop data that are "too late" and accordingly age out old aggregates that will not be updated any more.
To enable the user to specify details like lateness threshold, we are considering adding a new method to Dataset. We would like to get more feedback on this API. Here is the design doc https://docs.google.com/document/d/1z-Pazs5v4rA31azvmYhu4I5xwqaNQl6Z LIS03xhkfCQ/ Please comment on the design and proposed APIs. Thank you very much! TD