GitHub user ConeyLiu opened a pull request:
https://github.com/apache/spark/pull/19586
[SPARK-22367][CORE] Separate the serialization of class and object for
iteraor
## What changes were proposed in this pull request?
Becuase they are all the same class for an iterator. So there is no need
write class information for every record in the iterator. We only need write
the class information once at the serialization beginning, also only need read
the class information once for deserialization.
In this patch, we separate the serialization of class and object for an
iterator serialized by Kryo. This can improve the performance of the
serialization and deserialization, and save the space.
Test case:
```scala
val conf = new SparkConf().setAppName("Test for serialization")
val sc = new SparkContext(conf)
val random = new Random(1)
val data = sc.parallelize(1 to 1000000000).map { i =>
Person("id-" + i, random.nextInt(Integer.MAX_VALUE))
}.persist(StorageLevel.OFF_HEAP)
var start = System.currentTimeMillis()
data.count()
println("First time: " + (System.currentTimeMillis() - start))
start = System.currentTimeMillis()
data.count()
println("Second time: " + (System.currentTimeMillis() - start))
```
Test result:
The size of serialized:
before: 34.3GB
after: 17.5GB
| before(cal+serialization)| before(deserialization)|
after(cal+serialization)| after(deserialization) |
| ------| ------ | ------ | ------ |
| 63869| 21882| 45513| 15158|
| 59368| 21507| 51683| 15524|
| 66230| 21481| 62163| 14903|
| 62399| 22529| 52400| 16255|
| 137564.2 | 136990.8 | 1.004186 |
## How was this patch tested?
Existing UT.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/ConeyLiu/spark kryo
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/19586.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #19586
----
commit c681e81f9d49b3558c91a3b981504159bbeff910
Author: Xianyang Liu <[email protected]>
Date: 2017-10-26T06:37:04Z
serialize object and class seperately for iterator
commit 640ad5e1d12d1137f4c979a1e75dbdbd713e14de
Author: Xianyang Liu <[email protected]>
Date: 2017-10-26T06:42:58Z
Merge remote-tracking branch 'spark/master' into kryo
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]