Github user tdas commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4956#discussion_r26089012
  
    --- Diff: docs/streaming-programming-guide.md ---
    @@ -1801,40 +1894,40 @@ temporary data rate increases maybe fine as long as 
the delay reduces back to a
     
     ## Memory Tuning
     Tuning the memory usage and GC behavior of Spark applications have been 
discussed in great detail
    -in the [Tuning Guide](tuning.html). It is recommended that you read that. 
In this section,
    -we highlight a few customizations that are strongly recommended to 
minimize GC related pauses
    -in Spark Streaming applications and achieving more consistent batch 
processing times.
    -
    -* **Default persistence level of DStreams**: Unlike RDDs, the default 
persistence level of DStreams
    -serializes the data in memory (that is,
    
-[StorageLevel.MEMORY_ONLY_SER](api/scala/index.html#org.apache.spark.storage.StorageLevel$)
 for
    -DStream compared to
    
-[StorageLevel.MEMORY_ONLY](api/scala/index.html#org.apache.spark.storage.StorageLevel$)
 for RDDs).
    -Even though keeping the data serialized incurs higher 
serialization/deserialization overheads,
    -it significantly reduces GC pauses.
    -
    -* **Clearing persistent RDDs**: By default, all persistent RDDs generated 
by Spark Streaming will
    - be cleared from memory based on Spark's built-in policy (LRU). If 
`spark.cleaner.ttl` is set,
    - then persistent RDDs that are older than that value are periodically 
cleared. As mentioned
    - [earlier](#operation), this needs to be careful set based on operations 
used in the Spark
    - Streaming program. However, a smarter unpersisting of RDDs can be enabled 
by setting the
    - [configuration property](configuration.html#spark-properties) 
`spark.streaming.unpersist` to
    - `true`. This makes the system to figure out which RDDs are not necessary 
to be kept around and
    - unpersists them. This is likely to reduce
    - the RDD memory usage of Spark, potentially improving GC behavior as well.
    -
    -* **Concurrent garbage collector**: Using the concurrent mark-and-sweep GC 
further
    -minimizes the variability of GC pauses. Even though concurrent GC is known 
to reduce the
    +in the [Tuning Guide](tuning.html#memory-tuning). It is strongly 
recommended that you read that. In this section, we discuss a few tuning 
parameters specifically in the context of Spark Streaming applications.
    +
    +The amount of cluster memory required by a Spark Streaming application 
depends heavily on the type of transformations used. For example, if you want 
to use a window operation on last 10 minutes of data, then your cluster should 
have sufficient memory to hold 10 minutes of worth of data in memory. Or if you 
want to use `updateStateByKey` with a large number of keys, then the necessary 
memory  will be high. On the contrary, if you want to do a simple 
map-filter-store operation, then necessary memory will be low.
    +
    +In general, since the data received through receivers are stored with 
StorageLevel.MEMORY_AND_DISK_SER_2, the data that does not fit in memory will 
spill over to the disk. This may reduce the performance of the streaming 
application, and hence it is advised to provide sufficient memory as required 
by your streaming application. Its best to try and see the memory usage on a 
small scale and estimate accordingly. 
    +
    +Another aspect of memory tuning is garbage collection. For a streaming 
application that require low latency, it is undesirable to have large pauses 
caused by JVM Garbage Collection. 
    +
    +There are a few parameters that can help you tune the memory usage and GC 
overheads.
    +
    +* **Persistence Level of DStreams**: As mentioned earlier in the [Data 
Serialization](#data-serialization) section, the input data and RDDs are by 
default persisted as serialized bytes. This reduces both, the memory usage and 
GC overheads, compared to deserialized persistence. Enabling Kryo 
serialization, further reduces serialized sizes and memory usage. Further 
reduction in usage using compression (see the Spark configuration 
`spark.rdd.compress`) can come at the cost of CPU time.
    +
    +* **Clearing old data**: By default, all input data and persisted RDDs 
generated by DStream transformations are automatically cleared. Spark Streaming 
decides when to clear the data based on the transformations that are used. For 
example, if you are using window operation of 10 minutes, then Spark Streaming 
will keep around last 10 minutes of data, and actively throw away older data. 
    +Data can be retained for longer duration (e.g. interactively querying 
older data) by setting `streamingContext.remember`.
    +
    +* **CMS Garbage Collector**: Use of the concurrent mark-and-sweep GC is 
strongly recommended for keeping GC-related pauses consistently low. Even 
though concurrent GC is known to reduce the
     overall processing throughput of the system, its use is still recommended 
to achieve more
    -consistent batch processing times.
    +consistent batch processing times. Make sure you set the CMS GC on both 
the driver (using `--driver-java-options` in `spark-submit`) and the executors 
(using [Spark configuration](configuration.html#runtime-environment) 
`spark.executor.extraJavaOptions`).
    --- End diff --
    
    Will do if I can test it out once.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to