[GitHub] spark pull request #17329: [SPARK-19991]FileSegmentManagedBuffer performance...

witgo Thu, 16 Mar 2017 20:29:47 -0700

GitHub user witgo opened a pull request:

    https://github.com/apache/spark/pull/17329


    [SPARK-19991]FileSegmentManagedBuffer performance improvement

    FileSegmentManagedBuffer performance improvement.
    
    
    ## What changes were proposed in this pull request?
    
    When we do not set the value of the configuration items 
`spark.storage.memoryMapThreshold` and `spark.shuffle.io.lazyFD`, 
    each call to the cFileSegmentManagedBuffer.nioByteBuffer or 
FileSegmentManagedBuffer.createInputStream method creates a 
NoSuchElementException instance. This is a more time-consuming operation.
    
    In the use case, this PR can improve the performance of about 3.5%
    
    The test code:
    
    ``` scala
    
    (1 to 10).foreach { i =>
      val numPartition = 10000
      val rdd = sc.parallelize(0 until 
numPartition).repartition(numPartition).flatMap { t =>
        (0 until numPartition).map(r => r * numPartition + t)
      }.repartition(numPartition)
      val serializeStart = System.currentTimeMillis()
      rdd.sum()
      val serializeFinish = System.currentTimeMillis()
      println(f"Test $i: ${(serializeFinish - serializeStart) / 1000D}%1.2f")
    }
    
    
    ```
    
    and `spark-defaults.conf` file:
    
    ```
    spark.master                                      yarn-client
    spark.executor.instances                          20
    spark.driver.memory                               64g
    spark.executor.memory                             30g
    spark.executor.cores                              5
    spark.default.parallelism                         100 
    spark.sql.shuffle.partitions                      100
    spark.serializer                                  
org.apache.spark.serializer.KryoSerializer
    spark.driver.maxResultSize                        0
    spark.ui.enabled                                  false 
    spark.driver.extraJavaOptions                     -XX:+UseG1GC 
-XX:+UseStringDeduplication -XX:G1HeapRegionSize=16M -XX:MetaspaceSize=512M 
    spark.executor.extraJavaOptions                   -XX:+UseG1GC 
-XX:+UseStringDeduplication -XX:G1HeapRegionSize=16M -XX:MetaspaceSize=256M 
    spark.cleaner.referenceTracking.blocking          true
    spark.cleaner.referenceTracking.blocking.shuffle  true
    
    ```
    
    The test results are as follows
    
    | [SPARK-19991](https://github.com/witgo/spark/tree/SPARK-19991) 
|https://github.com/apache/spark/commit/68ea290b3aa89b2a539d13ea2c18bdb5a651b2bf|
    |---| --- | 
    |226.09 s| 235.21 s|
    
    ## How was this patch tested?
    
    Existing tests.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/witgo/spark SPARK-19991

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/17329.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #17329
    
----
commit abcfc79991ecd1d5cef2cd1e275b872695ba19d9
Author: Guoqiang Li <[email protected]>
Date:   2017-03-17T03:19:37Z

    FileSegmentManagedBuffer performance improvement

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #17329: [SPARK-19991]FileSegmentManagedBuffer performance...

Reply via email to