Hi all, 
Recently in our project, we need to update a RDD using data regularly
received from DStream, I plan to use "foreachRDD" API to achieve this: 
var MyRDD = ... 
dstream.foreachRDD { rdd => 
  MyRDD = MyRDD.join(rdd)....... 
  ... 
} 

Is this usage correct? My concern is, as I am repeatedly and endlessly
reassigning MyRDD in order to update it, will it create a too long RDD
lineage to process when I want to query MyRDD later on (similar as
https://issues.apache.org/jira/browse/SPARK-4672) ? 

Maybe I should: 
1. cache or checkpoint latest MyRDD and unpersist old MyRDD every time a
dstream comes in. 
2. use the unpublished IndexedRDD
(https://github.com/amplab/spark-indexedrdd) to conduct efficient RDD
update. 

As I lack experience using Spark Streaming and indexedRDD, I am here to make
sure my thoughts are on the right track. Your wise suggestions will be
greatly appreciated.



-----
Feel the sparking Spark!
--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Possible-long-lineage-issue-when-using-DStream-to-update-a-normal-RDD-tp12128.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to