GitHub user tdas opened a pull request:
https://github.com/apache/spark/pull/9256
[SPARK-2629][STREAMING] Basic implementation of trackStateByKey
Current updateStateByKey provides stateful processing in Spark Streaming.
It allows the user to maintain per-key state and manage that state using an
updateFunction. The updateFunction is called for each key, and it uses new data
and existing state of the key, to generate an updated state. However, based on
community feedback, we have learnt the following lessons.
* Need for more optimized state management that does not scan every key
* Need to make it easier to implement common use cases - (a) timeout of
idle data, (b) returning items other than state
The high level idea that of this PR
* Introduce a new API trackStateByKey that, allows the user to update
per-key state, and emit arbitrary records. The new API is necessary as this
will have significantly different semantics than the existing updateStateByKey
API. This API will have direct support for timeouts.
* Internally, the system will keep the state data as a map/list within the
partitions of the state RDDs. The new data RDDs will be partitioned
appropriately, and for all the key-value data, it will lookup the map/list in
the state RDD partition and create a new list/map of updated state data. The
new state RDD partition will be created based on the update data and if
necessary, with old data.
Here is the detailed design doc. Please take a look and provide feedback as
comments.
https://docs.google.com/document/d/1NoALLyd83zGs1hNGMm0Pc5YOVgiPpMHugGMk6COqxxE/edit#heading=h.ph3w0clkd4em
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/tdas/spark trackStateByKey
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/9256.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #9256
----
commit 7f15e29a4c2500cadbd9fd38a7aca5e58e2b9487
Author: Tathagata Das <[email protected]>
Date: 2015-10-08T08:22:22Z
First draft of sessionByKey
commit ff7731207dc415f951b9451d9b9d3aa2c7ef99cf
Author: Tathagata Das <[email protected]>
Date: 2015-10-09T01:46:11Z
Renamed SessionMap to SessionStore and fixed checkpointing bug in SessionRDD
commit 1fea358dda31ebc5bf089e95f12a9c56165e4555
Author: Tathagata Das <[email protected]>
Date: 2015-10-10T00:25:28Z
Added OpenHashMapBasedSessionStore
commit 27dbabc8e3d44467fdc10f2333ef42d59c6f85d8
Author: Tathagata Das <[email protected]>
Date: 2015-10-13T05:15:56Z
Fixed bugs
commit 3fc50675b3471c14abded7cbfd23693c81c2b58d
Author: Tathagata Das <[email protected]>
Date: 2015-10-13T05:42:52Z
Made delta chain threshold configurable
commit 514eb017406d531bba41070323cd51efe213ccad
Author: Tathagata Das <[email protected]>
Date: 2015-10-13T06:18:05Z
Fixed NPE
commit d5b2bec581cccb97b27878b39914b1e5327c3f86
Author: Tathagata Das <[email protected]>
Date: 2015-10-13T13:22:38Z
consolidation while checkpointing
commit 58eee1ee9f4960b13df084202ae2ed57e671a744
Author: Tathagata Das <[email protected]>
Date: 2015-10-13T13:40:26Z
Fixed bug
commit 672e3e620774aec13e8b81549a0d7bdbb98f8455
Author: Tathagata Das <[email protected]>
Date: 2015-10-13T14:21:27Z
optimized serialization
commit 51465f4bea5da3afe6067f05ce2a1bc965b277b5
Author: Tathagata Das <[email protected]>
Date: 2015-10-22T01:52:06Z
Updated API based on updated design
commit 10f6a0ecbb56e8aaad50681398bfbda7e1134f92
Author: Tathagata Das <[email protected]>
Date: 2015-10-23T02:41:42Z
Refactoring the code
commit b7c653d164536fc1954597451915221b39aae9f3
Author: Tathagata Das <[email protected]>
Date: 2015-10-23T22:23:51Z
Added licenses, and fixed stuff
commit bd9cd9449a15914b410a4482d610ce9355f28b54
Author: Tathagata Das <[email protected]>
Date: 2015-10-23T22:24:41Z
Merge remote-tracking branch 'apache-github/master' into trackStateByKey
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]