GitHub user witgo opened a pull request:
https://github.com/apache/spark/pull/17116
[SPARK-18890][CORE](try 2) Move task serialization from the TaskSetManager
to the CoarseGrainedSchedulerBackend
## What changes were proposed in this pull request?
See https://issues.apache.org/jira/browse/SPARK-18890
In the case of stage has a lot of tasks, this PR can improve the scheduling
performance of ~~15%~~
The test code:
``` scala
val rdd = sc.parallelize(0 until 100).repartition(100000)
rdd.localCheckpoint().count()
rdd.sum()
(1 to 10).foreach{ i=>
val serializeStart = System.currentTimeMillis()
rdd.sum()
val serializeFinish = System.currentTimeMillis()
println(f"Test $i: ${(serializeFinish - serializeStart) / 1000D}%1.2f")
}
```
and `spark-defaults.conf` file:
```
spark.master yarn-client
spark.executor.instances 20
spark.driver.memory 64g
spark.executor.memory 30g
spark.executor.cores 5
spark.default.parallelism 100
spark.sql.shuffle.partitions 100
spark.serializer
org.apache.spark.serializer.KryoSerializer
spark.driver.maxResultSize 0
spark.ui.enabled false
spark.driver.extraJavaOptions -XX:+UseG1GC
-XX:+UseStringDeduplication -XX:G1HeapRegionSize=16M -XX:MetaspaceSize=512M
spark.executor.extraJavaOptions -XX:+UseG1GC
-XX:+UseStringDeduplication -XX:G1HeapRegionSize=16M -XX:MetaspaceSize=256M
spark.cleaner.referenceTracking.blocking true
spark.cleaner.referenceTracking.blocking.shuffle true
```
The test results are as follows
**The table is out of date, to be updated**
| [SPARK-17931](https://github.com/witgo/spark/tree/SPARK-17931) |
[941b3f9](https://github.com/apache/spark/commit/941b3f9aca59e62c078508a934f8c2221ced96ce)
|
| --- | --- |
| 17.116 s | 21.764 s |
## How was this patch tested?
Existing tests.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/witgo/spark SPARK-18890
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/17116.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #17116
----
commit 84550a73b3f786b2ac569500a95ac38dc6f44657
Author: Guoqiang Li <[email protected]>
Date: 2017-01-08T11:18:59Z
Move task serialization from the TaskSetManager to the
CoarseGrainedSchedulerBackend
commit 39aa22e8b95a25cd250eeafeb5ec800dfa794896
Author: Guoqiang Li <[email protected]>
Date: 2017-01-11T06:05:53Z
review commits
commit d562727dfd554c78ea17e590841a1e74b9b4f9aa
Author: Guoqiang Li <[email protected]>
Date: 2017-01-13T02:10:03Z
add test "Scheduler aborts stages that have unserializable partition"
commit fc6789e027e8ea935ad392cfca90dd318a6d9e57
Author: Imran Rashid <[email protected]>
Date: 2017-01-13T21:42:44Z
refactor
commit 1edcf2a7e65e7c9373782824a71ec87909e88097
Author: Guoqiang Li <[email protected]>
Date: 2017-01-16T01:10:44Z
create all the serialized tasks to make sure they all work
commit 79dda74ab27eb9a2630921816305e489aef4f72e
Author: Guoqiang Li <[email protected]>
Date: 2017-01-22T03:28:04Z
review commits
commit 0b20da4c0f79d89b0689e19c6b5e3fcdf8b360fb
Author: Guoqiang Li <[email protected]>
Date: 2017-01-25T01:09:20Z
add lock on the scheduler object
commit b5de21f510697d42a9c7f7f255d20a41641a5122
Author: Kay Ousterhout <[email protected]>
Date: 2017-02-07T00:38:48Z
Consolidate TaskDescrition constructors.
This commit also does all task serializion in the encode() method,
so now the encode() method just takes the TaskDescription as an
input parameter.
commit 819a88cb74a41c446a442ff91fb14f1093025f77
Author: Guoqiang Li <[email protected]>
Date: 2017-02-07T15:21:11Z
Refactor the taskDesc serialization code
commit 900884b82f67d2f51f73aff49b671b2dcb450264
Author: Guoqiang Li <[email protected]>
Date: 2017-02-09T16:58:15Z
Add ut: serialization task errors do not affect each other
commit 0812fc9067b9a0652de64e5c0539eaed9d8f243d
Author: Guoqiang Li <[email protected]>
Date: 2017-02-25T17:17:03Z
askWithRetry => askSync
commit 0dae93a1edd2644cb63656ae56d41adaa59cf5e3
Author: Guoqiang Li <[email protected]>
Date: 2017-02-27T10:19:13Z
fix the import ordering in TaskDescription.scala
commit 9dab121247cf8b30912f016c404279acd0b42f41
Author: Guoqiang Li <[email protected]>
Date: 2017-03-01T08:06:39Z
review commits
commit b67bdaf52e44b320c565bef79fd2dac6904620ae
Author: Guoqiang Li <[email protected]>
Date: 2017-03-01T08:51:59Z
move prepareSerializedTask to TaskSchedulerImpl.scala
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]