I totally agree that doing if we are able to do this it will be very cool.
However, this requires having a common trait / interface between RDD and
DStream, which we dont have as of now. It would be very cool though. On my
wish list for sure.

TD


On Thu, Jul 10, 2014 at 11:53 AM, mshah <shahmaul...@gmail.com> wrote:

> I wanted to get a perspective on how to share code between Spark batch
> processing and Spark Streaming.
>
> For example, I want to get unique tweets stored in a HDFS file then in both
> Spark Batch and Spark Streaming. Currently I will have to do following
> thing:
>
> Tweet {
> String tweetText;
> String userId;
> }
>
> Spark Batch:
> tweets = sparkContext.newHadoopApiAsFile("tweet");
>
> def getUniqueTweets(tweets: RDD[Tweet])= {
>      tweets.map(tweet=>(tweetText,
> tweet).groupByKey(tweetText).map((tweetText, _) =>tweetText)
> }
>
> Spark Streaming:
>
> tweets = streamingContext.fileStream("tweet");
>
> def getUniqueTweets(tweets: DStream[Tweet])= {
>      tweets.map(tweet=>(tweetText,
> tweet).groupByKey(tweetText).map((tweetText, _) =>tweetText)
> }
>
>
> Above example shows I am doing the same thing but I have to replicate the
> code as there is no common abstraction between DStream and RDD,
> SparkContext
> and Streaming Context.
>
> If there was a common abstraction it would have been much simlper:
>
> tweets = context.read("tweet", Stream or Batch)
>
> def getUniqueTweets(tweets: SparkObject[Tweet])= {
>      tweets.map(tweet=>(tweetText,
> tweet).groupByKey(tweetText).map((tweetText, _) =>tweetText)
> }
>
> I would appreciate thoughts on it. Is it already available? Is there any
> plan to add this support? Is it intentionally not supported because of
> design choice?
>
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Generic-Interface-between-RDD-and-DStream-tp9331.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Reply via email to