GitHub user witgo opened a pull request:

    https://github.com/apache/spark/pull/17116

    [SPARK-18890][CORE](try 2) Move task serialization from the TaskSetManager 
to the CoarseGrainedSchedulerBackend

    ## What changes were proposed in this pull request?
    
    See https://issues.apache.org/jira/browse/SPARK-18890
    
    In the case of stage has a lot of tasks, this PR can improve the scheduling 
performance of ~~15%~~
    
    The test code:
    
    ``` scala
    
    val rdd = sc.parallelize(0 until 100).repartition(100000)
    rdd.localCheckpoint().count()
    rdd.sum()
    (1 to 10).foreach{ i=>
      val serializeStart = System.currentTimeMillis()
      rdd.sum()
      val serializeFinish = System.currentTimeMillis()
      println(f"Test $i: ${(serializeFinish - serializeStart) / 1000D}%1.2f")
    }
    
    ```
    
    and `spark-defaults.conf` file:
    
    ```
    spark.master                                      yarn-client
    spark.executor.instances                          20
    spark.driver.memory                               64g
    spark.executor.memory                             30g
    spark.executor.cores                              5
    spark.default.parallelism                         100 
    spark.sql.shuffle.partitions                      100
    spark.serializer                                  
org.apache.spark.serializer.KryoSerializer
    spark.driver.maxResultSize                        0
    spark.ui.enabled                                  false 
    spark.driver.extraJavaOptions                     -XX:+UseG1GC 
-XX:+UseStringDeduplication -XX:G1HeapRegionSize=16M -XX:MetaspaceSize=512M 
    spark.executor.extraJavaOptions                   -XX:+UseG1GC 
-XX:+UseStringDeduplication -XX:G1HeapRegionSize=16M -XX:MetaspaceSize=256M 
    spark.cleaner.referenceTracking.blocking          true
    spark.cleaner.referenceTracking.blocking.shuffle  true
    
    ```
    
    The test results are as follows
    
    **The table is out of date, to be updated**
    
    | [SPARK-17931](https://github.com/witgo/spark/tree/SPARK-17931) | 
[941b3f9](https://github.com/apache/spark/commit/941b3f9aca59e62c078508a934f8c2221ced96ce)
 |
    | --- | --- |
    | 17.116 s | 21.764 s |
    ## How was this patch tested?
    
    Existing tests.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/witgo/spark SPARK-18890

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17116.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17116
    
----
commit 84550a73b3f786b2ac569500a95ac38dc6f44657
Author: Guoqiang Li <[email protected]>
Date:   2017-01-08T11:18:59Z

    Move task serialization from the TaskSetManager to the 
CoarseGrainedSchedulerBackend

commit 39aa22e8b95a25cd250eeafeb5ec800dfa794896
Author: Guoqiang Li <[email protected]>
Date:   2017-01-11T06:05:53Z

    review commits

commit d562727dfd554c78ea17e590841a1e74b9b4f9aa
Author: Guoqiang Li <[email protected]>
Date:   2017-01-13T02:10:03Z

    add test "Scheduler aborts stages that have unserializable partition"

commit fc6789e027e8ea935ad392cfca90dd318a6d9e57
Author: Imran Rashid <[email protected]>
Date:   2017-01-13T21:42:44Z

    refactor

commit 1edcf2a7e65e7c9373782824a71ec87909e88097
Author: Guoqiang Li <[email protected]>
Date:   2017-01-16T01:10:44Z

    create all the serialized tasks to make sure they all work

commit 79dda74ab27eb9a2630921816305e489aef4f72e
Author: Guoqiang Li <[email protected]>
Date:   2017-01-22T03:28:04Z

    review commits

commit 0b20da4c0f79d89b0689e19c6b5e3fcdf8b360fb
Author: Guoqiang Li <[email protected]>
Date:   2017-01-25T01:09:20Z

    add lock on the scheduler object

commit b5de21f510697d42a9c7f7f255d20a41641a5122
Author: Kay Ousterhout <[email protected]>
Date:   2017-02-07T00:38:48Z

    Consolidate TaskDescrition constructors.
    
    This commit also does all task serializion in the encode() method,
    so now the encode() method just takes the TaskDescription as an
    input parameter.

commit 819a88cb74a41c446a442ff91fb14f1093025f77
Author: Guoqiang Li <[email protected]>
Date:   2017-02-07T15:21:11Z

    Refactor the taskDesc serialization code

commit 900884b82f67d2f51f73aff49b671b2dcb450264
Author: Guoqiang Li <[email protected]>
Date:   2017-02-09T16:58:15Z

    Add ut: serialization task errors do not affect each other

commit 0812fc9067b9a0652de64e5c0539eaed9d8f243d
Author: Guoqiang Li <[email protected]>
Date:   2017-02-25T17:17:03Z

    askWithRetry => askSync

commit 0dae93a1edd2644cb63656ae56d41adaa59cf5e3
Author: Guoqiang Li <[email protected]>
Date:   2017-02-27T10:19:13Z

    fix the import ordering in TaskDescription.scala

commit 9dab121247cf8b30912f016c404279acd0b42f41
Author: Guoqiang Li <[email protected]>
Date:   2017-03-01T08:06:39Z

    review commits

commit b67bdaf52e44b320c565bef79fd2dac6904620ae
Author: Guoqiang Li <[email protected]>
Date:   2017-03-01T08:51:59Z

    move prepareSerializedTask to TaskSchedulerImpl.scala

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to