[jira] [Commented] (SPARK-2629) Improve performance of DStream.updateStateByKey

Vinoth Chandar (JIRA) Wed, 25 Mar 2015 18:42:07 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381199#comment-14381199
 ]


Vinoth Chandar commented on SPARK-2629:
---------------------------------------

Hi [~tdas] , 
Stumbled across this while digging deeper into updateStateByKey's scaling 
limits. My understanding is the state (atleast the 'working set') has to fit in 
memory, since RDD is not an indexed structure and spark streaming regenerates 
an entire RDD after applying updateStateByKey, iterating through all elements. 
Can you elaborate on the IndexRDD approach or the alternatives being 
considered? 

Context : I am working on a ingestion layer for database change logs onto HDFS, 
so the state could be an entire database (6-10TB).  So super interested in 
this.. 

> Improve performance of DStream.updateStateByKey
> -----------------------------------------------
>
>                 Key: SPARK-2629
>                 URL: https://issues.apache.org/jira/browse/SPARK-2629
>             Project: Spark
>          Issue Type: Improvement
>          Components: Streaming
>            Reporter: Tathagata Das
>            Assignee: Tathagata Das
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-2629) Improve performance of DStream.updateStateByKey

Reply via email to