[
https://issues.apache.org/jira/browse/SPARK-2629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14381199#comment-14381199
]
Vinoth Chandar commented on SPARK-2629:
---------------------------------------
Hi [~tdas] ,
Stumbled across this while digging deeper into updateStateByKey's scaling
limits. My understanding is the state (atleast the 'working set') has to fit in
memory, since RDD is not an indexed structure and spark streaming regenerates
an entire RDD after applying updateStateByKey, iterating through all elements.
Can you elaborate on the IndexRDD approach or the alternatives being
considered?
Context : I am working on a ingestion layer for database change logs onto HDFS,
so the state could be an entire database (6-10TB). So super interested in
this..
> Improve performance of DStream.updateStateByKey
> -----------------------------------------------
>
> Key: SPARK-2629
> URL: https://issues.apache.org/jira/browse/SPARK-2629
> Project: Spark
> Issue Type: Improvement
> Components: Streaming
> Reporter: Tathagata Das
> Assignee: Tathagata Das
>
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]