Github user JoshRosen commented on a diff in the pull request:
https://github.com/apache/spark/pull/4956#discussion_r26089574
--- Diff: docs/streaming-programming-guide.md ---
@@ -1740,16 +1834,15 @@ documentation), or set the
`spark.default.parallelism`
### Data Serialization
{:.no_toc}
-The overhead of data serialization can be significant, especially when
sub-second batch sizes are
- to be achieved. There are two aspects to it.
+The overheads of data serialization can be reduce by tuning the
serialization formats. In case of streaming, there are two types of data that
are being serialized.
-* **Serialization of RDD data in Spark**: Please refer to the detailed
discussion on data
- serialization in the [Tuning Guide](tuning.html). However, note that
unlike Spark, by default
- RDDs are persisted as serialized byte arrays to minimize pauses related
to GC.
+* **Input data**: By default, the input data received through Receivers is
stored in the executors' memory with
[StorageLevel.MEMORY_AND_DISK_SER_2](api/scala/index.html#org.apache.spark.storage.StorageLevel$).
That is, the data is serialized into bytes to reduce GC overheads, and
replicated for tolerating executor failures. Also, the data is kept first in
memory, and spilled over to disk only if the memory is unsufficient to hold all
the input data necessary for the streaming computation. This serialization
obviously has overheads -- the receiver must deserialize the received data and
re-serialize it using Spark's serialization format.
-* **Serialization of input data**: To ingest external data into Spark,
data received as bytes
- (say, from the network) needs to deserialized from bytes and
re-serialized into Spark's
- serialization format. Hence, the deserialization overhead of input data
may be a bottleneck.
+* **Persisted RDDs generated by Streaming Operations**: RDDs generated by
streaming computations may be persisted in memory. For example, window
operation persist data in memory as they would be processed multiple times.
However, unlike Spark, by default RDDs are persisted with
[StorageLevel.MEMORY_ONLY_SER](api/scala/index.html#org.apache.spark.storage.StorageLevel$)
(i.e. serialized) to minimize GC overheads.
+
+In both cases, using Kryo serialization can reduce both CPU and memory
overheads. See the [Spark Tuning Guide](tuning.html#data-serialization)) for
more details. Consider registering custom classes, and disabling object
reference tracking for Kryo (see Kryo-related configurations in the
[Configuration Guide](configuration.html#compression-and-serialization)).
+
+In specific cases where the amount of data that needs to be retained for
the streaming application is not large, it may be feasible to persist data
(both types) as deserialized objects without incurring excessive GC overheads.
For example, if you are using batch intervals of few seconds and no window
operations, then you can try disabling serialization in persisted data by
explicitly setting the storage level accordingly. This would reduce the CPU
overheads due to serialization, potentially
--- End diff --
The final sentence of this paragraph is incomplete.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]