Github user tdas commented on a diff in the pull request:

    https://github.com/apache/spark/pull/4956#discussion_r26094413
  
    --- Diff: docs/streaming-programming-guide.md ---
    @@ -1740,16 +1834,15 @@ documentation), or set the 
`spark.default.parallelism`
     
     ### Data Serialization
     {:.no_toc}
    -The overhead of data serialization can be significant, especially when 
sub-second batch sizes are
    - to be achieved. There are two aspects to it.
    +The overheads of data serialization can be reduce by tuning the 
serialization formats. In case of streaming, there are two types of data that 
are being serialized.
     
    -* **Serialization of RDD data in Spark**: Please refer to the detailed 
discussion on data
    -  serialization in the [Tuning Guide](tuning.html). However, note that 
unlike Spark, by default
    -  RDDs are persisted as serialized byte arrays to minimize pauses related 
to GC.
    +* **Input data**: By default, the input data received through Receivers is 
stored in the executors' memory with 
[StorageLevel.MEMORY_AND_DISK_SER_2](api/scala/index.html#org.apache.spark.storage.StorageLevel$).
 That is, the data is serialized into bytes to reduce GC overheads, and 
replicated for tolerating executor failures. Also, the data is kept first in 
memory, and spilled over to disk only if the memory is unsufficient to hold all 
the input data necessary for the streaming computation. This serialization 
obviously has overheads -- the receiver must deserialize the received data and 
re-serialize it using Spark's serialization format. 
     
    -* **Serialization of input data**: To ingest external data into Spark, 
data received as bytes
    -  (say, from the network) needs to deserialized from bytes and 
re-serialized into Spark's
    -  serialization format. Hence, the deserialization overhead of input data 
may be a bottleneck.
    +* **Persisted RDDs generated by Streaming Operations**: RDDs generated by 
streaming computations may be persisted in memory. For example, window 
operation persist data in memory as they would be processed multiple times. 
However, unlike Spark, by default RDDs are persisted with 
[StorageLevel.MEMORY_ONLY_SER](api/scala/index.html#org.apache.spark.storage.StorageLevel$)
 (i.e. serialized) to minimize GC overheads.
    +
    +In both cases, using Kryo serialization can reduce both CPU and memory 
overheads. See the [Spark Tuning Guide](tuning.html#data-serialization)) for 
more details. Consider registering custom classes, and disabling object 
reference tracking for Kryo (see Kryo-related configurations in the 
[Configuration Guide](configuration.html#compression-and-serialization)).
    +
    +In specific cases where the amount of data that needs to be retained for 
the streaming application is not large, it may be feasible to persist data 
(both types) as deserialized objects without incurring excessive GC overheads. 
For example, if you are using batch intervals of few seconds and no window 
operations, then you can try disabling serialization in persisted data by 
explicitly setting the storage level accordingly. This would reduce the CPU 
overheads due to serialization, potentially 
    --- End diff --
    
    Done.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to