[
https://issues.apache.org/jira/browse/SPARK-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tathagata Das updated SPARK-2629:
---------------------------------
Target Version/s: (was: 1.6.0)
> Improved state management for Spark Streaming
> ---------------------------------------------
>
> Key: SPARK-2629
> URL: https://issues.apache.org/jira/browse/SPARK-2629
> Project: Spark
> Issue Type: Epic
> Components: Streaming
> Affects Versions: 0.9.2, 1.0.2, 1.2.2, 1.3.1, 1.4.1, 1.5.1
> Reporter: Tathagata Das
> Assignee: Tathagata Das
>
> Current updateStateByKey provides stateful processing in Spark Streaming. It
> allows the user to maintain per-key state and manage that state using an
> updateFunction. The updateFunction is called for each key, and it uses new
> data and existing state of the key, to generate an updated state. However,
> based on community feedback, we have learnt the following lessons.
> - Need for more optimized state management that does not scan every key
> - Need to make it easier to implement common use cases - (a) timeout of idle
> data, (b) returning items other than state
> The high level idea that I am proposing is
> - Introduce a new API trackStateByKey that, allows the user to update per-key
> state, and emit arbitrary records. The new API is necessary as this will have
> significantly different semantics than the existing updateStateByKey API.
> This API will have direct support for timeouts.
> - Internally, the system will keep the state data as a map/list within the
> partitions of the state RDDs. The new data RDDs will be partitioned
> appropriately, and for all the key-value data, it will lookup the map/list in
> the state RDD partition and create a new list/map of updated state data. The
> new state RDD partition will be created based on the update data and if
> necessary, with old data.
> Here is the detailed design doc. Please take a look and provide feedback as
> comments.
> https://docs.google.com/document/d/1NoALLyd83zGs1hNGMm0Pc5YOVgiPpMHugGMk6COqxxE/edit#heading=h.ph3w0clkd4em
> A WIP PR is here - https://github.com/apache/spark/pull/9256
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]