spark git commit: [SPARK-7848][STREAMING][UPDATE SPARKSTREAMING DOCS TO INCORPORATE IMPORTANT POINTS.]

rxin Wed, 15 Jun 2016 15:36:47 -0700

Repository: spark
Updated Branches:
  refs/heads/branch-2.0 2c1aae442 -> 73bf87f3c



[SPARK-7848][STREAMING][UPDATE SPARKSTREAMING DOCS TO INCORPORATE IMPORTANT 
POINTS.]

Updated the SparkStreaming Doc with some important points.

Author: Nirman Narang <[email protected]>

Closes #11114 from nirmannarang/SPARK-7848.

(cherry picked from commit 04d7b3d2b6b9953de399fd743e596c310234042f)
Signed-off-by: Reynold Xin <[email protected]>


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/73bf87f3
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/73bf87f3
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/73bf87f3

Branch: refs/heads/branch-2.0
Commit: 73bf87f3c80a0935f67e2b20f7a9b1a6766ec227
Parents: 2c1aae4
Author: Nirman Narang <[email protected]>
Authored: Wed Jun 15 15:36:31 2016 -0700
Committer: Reynold Xin <[email protected]>
Committed: Wed Jun 15 15:36:36 2016 -0700

----------------------------------------------------------------------
 docs/streaming-programming-guide.md | 19 +++++++++++++++++++
 1 file changed, 19 insertions(+)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/73bf87f3/docs/streaming-programming-guide.md
----------------------------------------------------------------------
diff --git a/docs/streaming-programming-guide.md 
b/docs/streaming-programming-guide.md
index 4ea3b60..db06a65 100644
--- a/docs/streaming-programming-guide.md
+++ b/docs/streaming-programming-guide.md
@@ -2181,6 +2181,25 @@ consistent batch processing times. Make sure you set the 
CMS GC on both the driv
     - Persist RDDs using the `OFF_HEAP` storage level. See more detail in the 
[Spark Programming Guide](programming-guide.html#rdd-persistence).
     - Use more executors with smaller heap sizes. This will reduce the GC 
pressure within each JVM heap.
 
+***
+
+##### Important points to remember:
+{:.no_toc}
+- A DStream is associated with a single receiver. For attaining read 
parallelism multiple receivers i.e. multiple DStreams need to be created. A 
receiver is run within an executor. It occupies one core. Ensure that there are 
enough cores for processing after receiver slots are booked i.e. 
`spark.cores.max` should take the receiver slots into account. The receivers 
are allocated to executors in a round robin fashion.
+
+- When data is received from a stream source, receiver creates blocks of data. 
 A new block of data is generated every blockInterval milliseconds. N blocks of 
data are created during the batchInterval where N = 
batchInterval/blockInterval. These blocks are distributed by the BlockManager 
of the current executor to the block managers of other executors. After that, 
the Network Input Tracker running on the driver is informed about the block 
locations for further processing.
+
+- A RDD is created on the driver for the blocks created during the 
batchInterval. The blocks generated during the batchInterval are partitions of 
the RDD. Each partition is a task in spark. blockInterval== batchinterval would 
mean that a single partition is created and probably it is processed locally.
+
+- The map tasks on the blocks are processed in the executors (one that 
received the block, and another where the block was replicated) that has the 
blocks irrespective of block interval, unless non-local scheduling kicks in.
+Having bigger blockinterval means bigger blocks. A high value of 
`spark.locality.wait` increases the chance of processing a block on the local 
node. A balance needs to be found out between these two parameters to ensure 
that the bigger blocks are processed locally.
+
+- Instead of relying on batchInterval and blockInterval, you can define the 
number of partitions by calling `inputDstream.repartition(n)`. This reshuffles 
the data in RDD randomly to create n number of partitions. Yes, for greater 
parallelism. Though comes at the cost of a shuffle. An RDD's processing is 
scheduled by driver's jobscheduler as a job. At a given point of time only one 
job is active. So, if one job is executing the other jobs are queued.
+
+- If you have two dstreams there will be two RDDs formed and there will be two 
jobs created which will be scheduled one after the another. To avoid this, you 
can union two dstreams. This will ensure that a single unionRDD is formed for 
the two RDDs of the dstreams. This unionRDD is then considered as a single job. 
However the partitioning of the RDDs is not impacted.
+
+- If the batch processing time is more than batchinterval then obviously the 
receiver's memory will start filling up and will end up in throwing exceptions 
(most probably BlockNotFoundException). Currently there is  no way to pause the 
receiver. Using SparkConf configuration `spark.streaming.receiver.maxRate`, 
rate of receiver can be limited.
+
 
 
***************************************************************************************************
 
***************************************************************************************************


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

spark git commit: [SPARK-7848][STREAMING][UPDATE SPARKSTREAMING DOCS TO INCORPORATE IMPORTANT POINTS.]

Reply via email to