Github user JoshRosen commented on a diff in the pull request:
https://github.com/apache/spark/pull/4956#discussion_r26088513
--- Diff: docs/streaming-programming-guide.md ---
@@ -1801,40 +1894,40 @@ temporary data rate increases maybe fine as long as
the delay reduces back to a
## Memory Tuning
Tuning the memory usage and GC behavior of Spark applications have been
discussed in great detail
-in the [Tuning Guide](tuning.html). It is recommended that you read that.
In this section,
-we highlight a few customizations that are strongly recommended to
minimize GC related pauses
-in Spark Streaming applications and achieving more consistent batch
processing times.
-
-* **Default persistence level of DStreams**: Unlike RDDs, the default
persistence level of DStreams
-serializes the data in memory (that is,
-[StorageLevel.MEMORY_ONLY_SER](api/scala/index.html#org.apache.spark.storage.StorageLevel$)
for
-DStream compared to
-[StorageLevel.MEMORY_ONLY](api/scala/index.html#org.apache.spark.storage.StorageLevel$)
for RDDs).
-Even though keeping the data serialized incurs higher
serialization/deserialization overheads,
-it significantly reduces GC pauses.
-
-* **Clearing persistent RDDs**: By default, all persistent RDDs generated
by Spark Streaming will
- be cleared from memory based on Spark's built-in policy (LRU). If
`spark.cleaner.ttl` is set,
- then persistent RDDs that are older than that value are periodically
cleared. As mentioned
- [earlier](#operation), this needs to be careful set based on operations
used in the Spark
- Streaming program. However, a smarter unpersisting of RDDs can be enabled
by setting the
- [configuration property](configuration.html#spark-properties)
`spark.streaming.unpersist` to
- `true`. This makes the system to figure out which RDDs are not necessary
to be kept around and
- unpersists them. This is likely to reduce
- the RDD memory usage of Spark, potentially improving GC behavior as well.
-
-* **Concurrent garbage collector**: Using the concurrent mark-and-sweep GC
further
-minimizes the variability of GC pauses. Even though concurrent GC is known
to reduce the
+in the [Tuning Guide](tuning.html#memory-tuning). It is strongly
recommended that you read that. In this section, we discuss a few tuning
parameters specifically in the context of Spark Streaming applications.
+
+The amount of cluster memory required by a Spark Streaming application
depends heavily on the type of transformations used. For example, if you want
to use a window operation on last 10 minutes of data, then your cluster should
have sufficient memory to hold 10 minutes of worth of data in memory. Or if you
want to use `updateStateByKey` with a large number of keys, then the necessary
memory will be high. On the contrary, if you want to do a simple
map-filter-store operation, then necessary memory will be low.
+
+In general, since the data received through receivers are stored with
StorageLevel.MEMORY_AND_DISK_SER_2, the data that does not fit in memory will
spill over to the disk. This may reduce the performance of the streaming
application, and hence it is advised to provide sufficient memory as required
by your streaming application. Its best to try and see the memory usage on a
small scale and estimate accordingly.
+
+Another aspect of memory tuning is garbage collection. For a streaming
application that require low latency, it is undesirable to have large pauses
caused by JVM Garbage Collection.
+
+There are a few parameters that can help you tune the memory usage and GC
overheads.
+
+* **Persistence Level of DStreams**: As mentioned earlier in the [Data
Serialization](#data-serialization) section, the input data and RDDs are by
default persisted as serialized bytes. This reduces both, the memory usage and
GC overheads, compared to deserialized persistence. Enabling Kryo
serialization, further reduces serialized sizes and memory usage. Further
reduction in usage using compression (see the Spark configuration
`spark.rdd.compress`) can come at the cost of CPU time.
+
+* **Clearing old data**: By default, all input data and persisted RDDs
generated by DStream transformations are automatically cleared. Spark Streaming
decides when to clear the data based on the transformations that are used. For
example, if you are using window operation of 10 minutes, then Spark Streaming
will keep around last 10 minutes of data, and actively throw away older data.
+Data can be retained for longer duration (e.g. interactively querying
older data) by setting `streamingContext.remember`.
+
+* **CMS Garbage Collector**: Use of the concurrent mark-and-sweep GC is
strongly recommended for keeping GC-related pauses consistently low. Even
though concurrent GC is known to reduce the
overall processing throughput of the system, its use is still recommended
to achieve more
-consistent batch processing times.
+consistent batch processing times. Make sure you set the CMS GC on both
the driver (using `--driver-java-options` in `spark-submit`) and the executors
(using [Spark configuration](configuration.html#runtime-environment)
`spark.executor.extraJavaOptions`).
--- End diff --
Maybe we could include an actual example value of these properties that
shows the exact JVM flag to use?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]