[
https://issues.apache.org/jira/browse/SPARK-18124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15623603#comment-15623603
]
Apache Spark commented on SPARK-18124:
--------------------------------------
User 'marmbrus' has created a pull request for this issue:
https://github.com/apache/spark/pull/15702
> Implement watermarking for handling late data
> ---------------------------------------------
>
> Key: SPARK-18124
> URL: https://issues.apache.org/jira/browse/SPARK-18124
> Project: Spark
> Issue Type: Sub-task
> Components: SQL, Streaming
> Reporter: Tathagata Das
> Assignee: Michael Armbrust
>
> Whenever we aggregate data by event time, we want to consider data is late
> and out-of-order in terms of its event time. Since we keep aggregate keyed by
> the time as state, the state will grow unbounded if we keep around all old
> aggregates in an attempt consider arbitrarily late data. Since the state is a
> store in-memory, we have to prevent building up of this unbounded state.
> Hence, we need a watermarking mechanism by which we will mark data that is
> older beyond a threshold as “too late”, and stop updating the aggregates with
> them. This would allow us to remove old aggregates that are never going to be
> updated, thus bounding the size of the state.
> Here is the design doc -
> https://docs.google.com/document/d/1z-Pazs5v4rA31azvmYhu4I5xwqaNQl6ZLIS03xhkfCQ/edit?usp=sharing
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]