[jira] [Commented] (SPARK-2629) Improved state management for Spark Streaming

Iain Cundy (JIRA) Wed, 06 Apr 2016 08:12:01 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15228434#comment-15228434
 ]


Iain Cundy commented on SPARK-2629:
-----------------------------------

Hi Jan

I don't have any more information regarding Kryo in 1.6.1 - there were no 
private answers. I notice that 1.6.1 has been released, but I haven't had a 
convenient opportunity to upgrade to it yet.

Cheers
Iain



> Improved state management for Spark Streaming
> ---------------------------------------------
>
>                 Key: SPARK-2629
>                 URL: https://issues.apache.org/jira/browse/SPARK-2629
>             Project: Spark
>          Issue Type: Epic
>          Components: Streaming
>    Affects Versions: 0.9.2, 1.0.2, 1.2.2, 1.3.1, 1.4.1, 1.5.1
>            Reporter: Tathagata Das
>            Assignee: Tathagata Das
>
>  Current updateStateByKey provides stateful processing in Spark Streaming. It 
> allows the user to maintain per-key state and manage that state using an 
> updateFunction. The updateFunction is called for each key, and it uses new 
> data and existing state of the key, to generate an updated state. However, 
> based on community feedback, we have learnt the following lessons.
> - Need for more optimized state management that does not scan every key
> - Need to make it easier to implement common use cases - (a) timeout of idle 
> data, (b) returning items other than state
> The high level idea that I am proposing is 
> - Introduce a new API -trackStateByKey- *mapWithState* that, allows the user 
> to update per-key state, and emit arbitrary records. The new API is necessary 
> as this will have significantly different semantics than the existing 
> updateStateByKey API. This API will have direct support for timeouts.
> - Internally, the system will keep the state data as a map/list within the 
> partitions of the state RDDs. The new data RDDs will be partitioned 
> appropriately, and for all the key-value data, it will lookup the map/list in 
> the state RDD partition and create a new list/map of updated state data. The 
> new state RDD partition will be created based on the update data and if 
> necessary, with old data. 
> Here is the detailed design doc (*outdated, to be updated*). Please take a 
> look and provide feedback as comments.
> https://docs.google.com/document/d/1NoALLyd83zGs1hNGMm0Pc5YOVgiPpMHugGMk6COqxxE/edit#heading=h.ph3w0clkd4em



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-2629) Improved state management for Spark Streaming

Reply via email to