Repository: spark
Updated Branches:
  refs/heads/branch-0.9 1cc979e0a -> 1e366904c


[SPARK-1285] Backporting updates to streaming docs to branch 0.9

Cherrypicked updates that have been added to master branch

Author: Aaron Kimball <aa...@magnify.io>
Author: Tathagata Das <tathagata.das1...@gmail.com>
Author: Chen Chao <crazy...@gmail.com>
Author: Andrew Or <andrewo...@gmail.com>

Closes #183 from tdas/branch-0.9-streaming-docs and squashes the following 
commits:

e1a988f [Tathagata Das] Added clean to run-tests
98c3e98 [Tathagata Das] Merge remote-tracking branch 'apache-github/branch-0.9' 
into branch-0.9-streaming-docs
d792351 [Chen Chao] maintain arbitrary state data for each key
e708f74 [Aaron Kimball] SPARK-1173. (#2) Fix typo in Java streaming example.
156bcd7 [Aaron Kimball] SPARK-1173. Improve scala streaming docs.
8849a96 [Andrew Or] Fix typos in Spark Streaming programming guide
fbd66a5 [Chen Chao] Merge pull request #579 from CrazyJvm/patch-1.


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/1e366904
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/1e366904
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/1e366904

Branch: refs/heads/branch-0.9
Commit: 1e366904cf79f114b9383d898c392aea5747823a
Parents: 1cc979e
Author: Aaron Kimball <aa...@magnify.io>
Authored: Thu Mar 20 12:27:47 2014 -0700
Committer: Tathagata Das <tathagata.das1...@gmail.com>
Committed: Thu Mar 20 12:27:47 2014 -0700

----------------------------------------------------------------------
 dev/run-tests                       |  2 +-
 docs/streaming-programming-guide.md | 71 ++++++++++++++++++++++----------
 2 files changed, 50 insertions(+), 23 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/1e366904/dev/run-tests
----------------------------------------------------------------------
diff --git a/dev/run-tests b/dev/run-tests
index a5dcacb..0003871 100755
--- a/dev/run-tests
+++ b/dev/run-tests
@@ -30,7 +30,7 @@ set -e
 echo 
"========================================================================="
 echo "Running Spark unit tests"
 echo 
"========================================================================="
-sbt/sbt assembly test
+sbt/sbt clean assembly test
 
 echo 
"========================================================================="
 echo "Running PySpark tests"

http://git-wip-us.apache.org/repos/asf/spark/blob/1e366904/docs/streaming-programming-guide.md
----------------------------------------------------------------------
diff --git a/docs/streaming-programming-guide.md 
b/docs/streaming-programming-guide.md
index 4985c52..f9904d4 100644
--- a/docs/streaming-programming-guide.md
+++ b/docs/streaming-programming-guide.md
@@ -58,11 +58,21 @@ do is as follows.
 
 <div class="codetabs">
 <div data-lang="scala"  markdown="1" >
+First, we import the names of the Spark Streaming classes, and some implicit
+conversions from StreamingContext into our environment, to add useful methods 
to
+other classes we need (like DStream).
 
-First, we create a
-[StreamingContext](api/streaming/index.html#org.apache.spark.streaming.StreamingContext)
 object,
-which is the main entry point for all streaming
-functionality. Besides Spark's configuration, we specify that any DStream will 
be processed
+[StreamingContext](api/streaming/index.html#org.apache.spark.streaming.StreamingContext)
 is the
+main entry point for all streaming functionality.
+
+{% highlight scala %}
+import org.apache.spark.streaming._
+import org.apache.spark.streaming.StreamingContext._
+{% endhighlight %}
+
+Then we create a
+[StreamingContext](api/streaming/index.html#org.apache.spark.streaming.StreamingContext)
 object.
+Besides Spark's configuration, we specify that any DStream will be processed
 in 1 second batches.
 
 {% highlight scala %}
@@ -88,7 +98,7 @@ val words = lines.flatMap(_.split(" "))
 {% endhighlight %}
 
 `flatMap` is a one-to-many DStream operation that creates a new DStream by
-generating multiple new records from each record int the source DStream. In 
this case,
+generating multiple new records from each record in the source DStream. In 
this case,
 each line will be split into multiple words and the stream of words is 
represented as the
 `words` DStream.  Next, we want to count these words.
 
@@ -98,7 +108,7 @@ val pairs = words.map(word => (word, 1))
 val wordCounts = pairs.reduceByKey(_ + _)
 
 // Print a few of the counts to the console
-wordCount.print()
+wordCounts.print()
 {% endhighlight %}
 
 The `words` DStream is further mapped (one-to-one transformation) to a DStream 
of `(word,
@@ -178,7 +188,7 @@ JavaPairDStream<String, Integer> wordCounts = 
pairs.reduceByKey(
       return i1 + i2;
     }
   });
-wordCount.print();     // Print a few of the counts to the console
+wordCounts.print();     // Print a few of the counts to the console
 {% endhighlight %}
 
 The `words` DStream is further mapped (one-to-one transformation) to a DStream 
of `(word,
@@ -262,6 +272,24 @@ Time: 1357008430000 ms
     </td>
 </table>
 
+If you plan to run the Scala code for Spark Streaming-based use cases in the 
Spark
+shell, you should start the shell with the SparkConfiguration pre-configured to
+discard old batches periodically:
+
+{% highlight bash %}
+$ SPARK_JAVA_OPTS=-Dspark.cleaner.ttl=10000 bin/spark-shell
+{% endhighlight %}
+
+... and create your StreamingContext by wrapping the existing interactive shell
+SparkContext object, `sc`:
+
+{% highlight scala %}
+val ssc = new StreamingContext(sc, Seconds(1))
+{% endhighlight %}
+
+When working with the shell, you may also need to send a `^D` to your netcat 
session
+to force the pipeline to print the word counts to the console at the sink.
+
 
***************************************************************************************************
  
 
 # Basics
@@ -428,9 +456,9 @@ KafkaUtils.createStream(javaStreamingContext, kafkaParams, 
...);
 </div>
 </div>
 
-For more details on these additional sources, see the corresponding [API 
documentation]
-(#where-to-go-from-here). Furthermore, you can also implement your own custom 
receiver
-for your sources. See the [Custom Receiver 
Guide](streaming-custom-receivers.html).
+For more details on these additional sources, see the corresponding [API 
documentation](#where-to-go-from-here).
+Furthermore, you can also implement your own custom receiver for your sources. 
See the
+[Custom Receiver Guide](streaming-custom-receivers.html).
 
 ## Operations
 There are two kinds of DStream operations - _transformations_ and _output 
operations_. Similar to
@@ -511,7 +539,7 @@ common ones are as follows.
   <td> <b>updateStateByKey</b>(<i>func</i>) </td>
   <td> Return a new "state" DStream where the state for each key is updated by 
applying the
   given function on the previous state of the key and the new values for the 
key. This can be
-  used to maintain arbitrary state data for each ket.</td>
+  used to maintain arbitrary state data for each key.</td>
 </tr>
 <tr><td></td><td></td></tr>
 </table>
@@ -520,9 +548,8 @@ The last two transformations are worth highlighting again.
 
 <h4>UpdateStateByKey Operation</h4>
 
-The `updateStateByKey` operation allows
-you to main arbitrary stateful computation, where you want to maintain some 
state data and
-continuously update it with new information. To use this, you will have to do 
two steps.
+The `updateStateByKey` operation allows you to maintain arbitrary state while 
continuously updating
+it with new information. To use this, you will have to do two steps.
 
 1. Define the state - The state can be of arbitrary data type.
 1. Define the state update function - Specify with a function how to update 
the state using the
@@ -925,7 +952,7 @@ exception saying so.
 ## Monitoring
 Besides Spark's in-built [monitoring capabilities](monitoring.html),
 the progress of a Spark Streaming program can also be monitored using the 
[StreamingListener]
-(streaming/index.html#org.apache.spark.scheduler.StreamingListener) interface,
+(api/streaming/index.html#org.apache.spark.scheduler.StreamingListener) 
interface,
 which allows you to get statistics of batch processing times, queueing delays,
 and total end-to-end delays. Note that this is still an experimental API and 
it is likely to be
 improved upon (i.e., more information reported) in the future.
@@ -1000,11 +1027,11 @@ Since all data is modeled as RDDs with their lineage of 
deterministic operations
  for output operations.
 
 ## Failure of the Driver Node
-To allows a streaming application to operate 24/7, Spark Streaming allows a 
streaming computation
+For a streaming application to operate 24/7, Spark Streaming allows a 
streaming computation
 to be resumed even after the failure of the driver node. Spark Streaming 
periodically writes the
 metadata information of the DStreams setup through the `StreamingContext` to a
 HDFS directory (can be any Hadoop-compatible filesystem). This periodic
-*checkpointing* can be enabled by setting a the checkpoint
+*checkpointing* can be enabled by setting the checkpoint
 directory using `ssc.checkpoint(<checkpoint directory>)` as described
 [earlier](#rdd-checkpointing). On failure of the driver node,
 the lost `StreamingContext` can be recovered from this information, and 
restarted.
@@ -1105,8 +1132,8 @@ classes. So, if you are using `getOrCreate`, then make 
sure that the checkpoint
 explicitly deleted every time recompiled code needs to be launched.
 
 This failure recovery can be done automatically using Spark's
-[standalone cluster mode](spark-standalone.html), which allows any Spark
-application's driver to be as well as, ensures automatic restart of the driver 
on failure (see
+[standalone cluster mode](spark-standalone.html), which allows the driver of 
any Spark application
+to be launched within the cluster and be restarted on failure (see
 [supervise 
mode](spark-standalone.html#launching-applications-inside-the-cluster)). This 
can be
 tested locally by launching the above example using the supervise mode in a
 local standalone cluster and killing the java process running the driver (will 
be shown as
@@ -1123,7 +1150,7 @@ There are two different failure behaviors based on which 
input sources are used.
 1. _Using HDFS files as input source_ - Since the data is reliably stored on 
HDFS, all data can
 re-computed and therefore no data will be lost due to any failure.
 1. _Using any input source that receives data through a network_ - The 
received input data is
-replicated in memory to multiple nodes. Since, all the data in the Spark 
worker's memory is lost
+replicated in memory to multiple nodes. Since all the data in the Spark 
worker's memory is lost
 when the Spark driver fails, the past input data will not be accessible and 
driver recovers.
 Hence, if stateful and window-based operations are used
 (like `updateStateByKey`, `window`, `countByValueAndWindow`, etc.), then the 
intermediate state
@@ -1133,11 +1160,11 @@ In future releases, we will support full recoverability 
for all input sources. N
 non-stateful transformations like `map`, `count`, and `reduceByKey`, with 
_all_ input streams,
 the system, upon restarting, will continue to receive and process new data.
 
-To better understand the behavior of the system under driver failure with a 
HDFS source, lets
+To better understand the behavior of the system under driver failure with a 
HDFS source, let's
 consider what will happen with a file input stream. Specifically, in the case 
of the file input
 stream, it will correctly identify new files that were created while the 
driver was down and
 process them in the same way as it would have if the driver had not failed. To 
explain further
-in the case of file input stream, we shall use an example. Lets say, files are 
being generated
+in the case of file input stream, we shall use an example. Let's say, files 
are being generated
 every second, and a Spark Streaming program reads every new file and output 
the number of lines
 in the file. This is what the sequence of outputs would be with and without a 
driver failure.
 

Reply via email to