Tathagata Das created SPARK-18124:
-------------------------------------

             Summary: Implement watermarking for handling late data
                 Key: SPARK-18124
                 URL: https://issues.apache.org/jira/browse/SPARK-18124
             Project: Spark
          Issue Type: Sub-task
            Reporter: Tathagata Das


Whenever we aggregate data by event time, we want to consider data is late and 
out-of-order in terms of its event time. Since we keep aggregate keyed by the 
time as state, the state will grow unbounded if we keep around all old 
aggregates in an attempt consider arbitrarily late data. Since the state is a 
store in-memory, we have to prevent building up of this unbounded state. Hence, 
we need a watermarking mechanism by which we will mark data that is older 
beyond a threshold as “too late”, and stop updating the aggregates with them. 
This would allow us to remove old aggregates that are never going to be 
updated, thus bounding the size of the state.

Here is the design doc - 
https://docs.google.com/document/d/1z-Pazs5v4rA31azvmYhu4I5xwqaNQl6ZLIS03xhkfCQ/edit?usp=sharing



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to