[
https://issues.apache.org/jira/browse/SPARK-18124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael Armbrust reassigned SPARK-18124:
----------------------------------------
Assignee: Michael Armbrust (was: Tathagata Das)
> Implement watermarking for handling late data
> ---------------------------------------------
>
> Key: SPARK-18124
> URL: https://issues.apache.org/jira/browse/SPARK-18124
> Project: Spark
> Issue Type: Sub-task
> Components: SQL, Streaming
> Reporter: Tathagata Das
> Assignee: Michael Armbrust
>
> Whenever we aggregate data by event time, we want to consider data is late
> and out-of-order in terms of its event time. Since we keep aggregate keyed by
> the time as state, the state will grow unbounded if we keep around all old
> aggregates in an attempt consider arbitrarily late data. Since the state is a
> store in-memory, we have to prevent building up of this unbounded state.
> Hence, we need a watermarking mechanism by which we will mark data that is
> older beyond a threshold as “too late”, and stop updating the aggregates with
> them. This would allow us to remove old aggregates that are never going to be
> updated, thus bounding the size of the state.
> Here is the design doc -
> https://docs.google.com/document/d/1z-Pazs5v4rA31azvmYhu4I5xwqaNQl6ZLIS03xhkfCQ/edit?usp=sharing
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]