Hi all, Recently in our project, we need to update a RDD using data regularly received from DStream, I plan to use "foreachRDD" API to achieve this: var MyRDD = ... dstream.foreachRDD { rdd => MyRDD = MyRDD.join(rdd)....... ... }
Is this usage correct? My concern is, as I am repeatedly and endlessly reassigning MyRDD in order to update it, will it create a too long RDD lineage to process when I want to query MyRDD later on (similar as https://issues.apache.org/jira/browse/SPARK-4672) ? Maybe I should: 1. cache or checkpoint latest MyRDD and unpersist old MyRDD every time a dstream comes in. 2. use the unpublished IndexedRDD (https://github.com/amplab/spark-indexedrdd) to conduct efficient RDD update. As I lack experience using Spark Streaming and indexedRDD, I am here to make sure my thoughts are on the right track. Your wise suggestions will be greatly appreciated. ----- Feel the sparking Spark! -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Possible-long-lineage-issue-when-using-DStream-to-update-a-normal-RDD-tp12128.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org