GitHub user tdas opened a pull request:

    https://github.com/apache/spark/pull/9256

    [SPARK-2629][STREAMING] Basic implementation of trackStateByKey

    Current updateStateByKey provides stateful processing in Spark Streaming. 
It allows the user to maintain per-key state and manage that state using an 
updateFunction. The updateFunction is called for each key, and it uses new data 
and existing state of the key, to generate an updated state. However, based on 
community feedback, we have learnt the following lessons.
    * Need for more optimized state management that does not scan every key
    * Need to make it easier to implement common use cases - (a) timeout of 
idle data, (b) returning items other than state
    
    The high level idea that of this PR
    * Introduce a new API trackStateByKey that, allows the user to update 
per-key state, and emit arbitrary records. The new API is necessary as this 
will have significantly different semantics than the existing updateStateByKey 
API. This API will have direct support for timeouts.
    * Internally, the system will keep the state data as a map/list within the 
partitions of the state RDDs. The new data RDDs will be partitioned 
appropriately, and for all the key-value data, it will lookup the map/list in 
the state RDD partition and create a new list/map of updated state data. The 
new state RDD partition will be created based on the update data and if 
necessary, with old data.
    Here is the detailed design doc. Please take a look and provide feedback as 
comments.
    
https://docs.google.com/document/d/1NoALLyd83zGs1hNGMm0Pc5YOVgiPpMHugGMk6COqxxE/edit#heading=h.ph3w0clkd4em

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tdas/spark trackStateByKey

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/9256.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #9256
    
----
commit 7f15e29a4c2500cadbd9fd38a7aca5e58e2b9487
Author: Tathagata Das <[email protected]>
Date:   2015-10-08T08:22:22Z

    First draft of sessionByKey

commit ff7731207dc415f951b9451d9b9d3aa2c7ef99cf
Author: Tathagata Das <[email protected]>
Date:   2015-10-09T01:46:11Z

    Renamed SessionMap to SessionStore and fixed checkpointing bug in SessionRDD

commit 1fea358dda31ebc5bf089e95f12a9c56165e4555
Author: Tathagata Das <[email protected]>
Date:   2015-10-10T00:25:28Z

    Added OpenHashMapBasedSessionStore

commit 27dbabc8e3d44467fdc10f2333ef42d59c6f85d8
Author: Tathagata Das <[email protected]>
Date:   2015-10-13T05:15:56Z

    Fixed bugs

commit 3fc50675b3471c14abded7cbfd23693c81c2b58d
Author: Tathagata Das <[email protected]>
Date:   2015-10-13T05:42:52Z

    Made delta chain threshold configurable

commit 514eb017406d531bba41070323cd51efe213ccad
Author: Tathagata Das <[email protected]>
Date:   2015-10-13T06:18:05Z

    Fixed NPE

commit d5b2bec581cccb97b27878b39914b1e5327c3f86
Author: Tathagata Das <[email protected]>
Date:   2015-10-13T13:22:38Z

    consolidation while checkpointing

commit 58eee1ee9f4960b13df084202ae2ed57e671a744
Author: Tathagata Das <[email protected]>
Date:   2015-10-13T13:40:26Z

    Fixed bug

commit 672e3e620774aec13e8b81549a0d7bdbb98f8455
Author: Tathagata Das <[email protected]>
Date:   2015-10-13T14:21:27Z

    optimized serialization

commit 51465f4bea5da3afe6067f05ce2a1bc965b277b5
Author: Tathagata Das <[email protected]>
Date:   2015-10-22T01:52:06Z

    Updated API based on updated design

commit 10f6a0ecbb56e8aaad50681398bfbda7e1134f92
Author: Tathagata Das <[email protected]>
Date:   2015-10-23T02:41:42Z

    Refactoring the code

commit b7c653d164536fc1954597451915221b39aae9f3
Author: Tathagata Das <[email protected]>
Date:   2015-10-23T22:23:51Z

    Added licenses, and fixed stuff

commit bd9cd9449a15914b410a4482d610ce9355f28b54
Author: Tathagata Das <[email protected]>
Date:   2015-10-23T22:24:41Z

    Merge remote-tracking branch 'apache-github/master' into trackStateByKey

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to