Re: streaming questions

Tathagata Das Wed, 26 Mar 2014 11:33:55 -0700

*Answer 1:*Make sure you are using master as "local[n]" with n > 1
(assuming you are running it in local mode). The way Spark Streaming
works is that it assigns a code to the data receiver, and so if you
run the program with only one core (i.e., with local or local[1]),
then it wont have resources to process data along with receiving it.



*Answer 2:*Spark Streaming is designed to replicate the received data
within the
machines in a Spark cluster for fault-tolerance. However, when you are
running in the local mode, since there is only one machine, the
"blocks" of data arent able to replicate. This is expected and safe to
ignore in local mode.


*Answer 3:*You can do something like
wordCounts.foreachRDD((rdd: RDD[...], time: Time) => {
   if (rdd.take(1).size == 1) {
      // There exists at least one element in RDD, so save it to file
      rdd.saveAsTextFile(<generate file name based on time>)
   }
}


TD



On Wed, Mar 26, 2014 at 11:08 AM, Diana Carroll <dcarr...@cloudera.com>
wrote:
> I'm trying to understand Spark streaming, hoping someone can help.
>
> I've kinda-sorta got a version of Word Count running, and it looks like
> this:
>
> import org.apache.spark.streaming.{Seconds, StreamingContext}
> import org.apache.spark.streaming.StreamingContext._
>
> object StreamingWordCount {
>
>   def main(args: Array[String]) {
>     if (args.length < 3) {
>       System.err.println("Usage: StreamingWordCount <master> <hostname>
> <port>")
>       System.exit(1)
>     }
>
>     val master = args(0)
>     val hostname = args(1)
>     val port = args(2).toInt
>
>     val ssc = new StreamingContext(master, "Streaming Word
> Count",Seconds(2))
>     val lines = ssc.socketTextStream(hostname, port)
>     val words = lines.flatMap(line => line.split(" "))
>     val wordCounts = words.map(x => (x, 1)).reduceByKey((x,y) => x+y)
>     wordCounts.print()
>     ssc.start()
>     ssc.awaitTermination()
>  }
> }
>
> (I also have a small script that sends text to that port.)
>
> Question 1:
> When I run this, I don't get any output from the wordCounts.print as long
as
> my data is still streaming.  I have to stop my streaming data script
before
> my program will display the word counts.
>
> Why is that?  What if my stream is indefinite?  I thought the point of
> Streaming was that it would process it in real time?
>
> Question 2:
> While I run this (and the stream is still sending) I get continuous
warning
> messages like this:
> 14/03/26 10:57:03 WARN BlockManager: Block input-0-1395856623200 already
> exists on this machine; not re-adding it
> 14/03/26 10:57:03 WARN BlockManager: Block input-0-1395856623400 already
> exists on this machine; not re-adding it
>
> What does that mean?
>
> Question 3:
> I tried replacing the wordCounts.print() line with
> wordCounts.saveAsTextFiles("file:/my/path/outdir").
> This results in the creation of a new outdir-timestamp file being created
> every two seconds...even if there's no data during that time period.  Is
> there a way to tell it to save only if there's data?
>
> Thanks!

Re: streaming questions

Reply via email to