[GitHub] spark pull request #16196: [SPARK-18231] Optimise SizeEstimator implementati...

a-roberts Wed, 07 Dec 2016 02:43:00 -0800

GitHub user a-roberts opened a pull request:

    https://github.com/apache/spark/pull/16196


    [SPARK-18231] Optimise SizeEstimator implementation

    ## What changes were proposed in this pull request?
    
    Several improvements to the SizeEstimator for performance, most of the 
benefit comes from, when estimating, contending to not contending on multiple 
threads. There can be a small boost in uncontended scenarios from the removal 
of the synchronisation code but the cost of that synchronisation when not truly 
contended is low. On the PageRank workload for HiBench we see 10-15% 
performance improvements (measuring elapsed times on average) with both IBM's 
SDK for Java and OpenJDK 8. I don't see any changes other than noise for the 
other workloads on this benchmark.
    
    ## How was this patch tested?
    
    Existing unit tests but there are problems to resolve.
    
    I see SizeEstimatorSuite and SizeTrackerSuite failing with at least IBM 
Java now due to smaller sizes being reported than the test expects (let's see 
what happens with OpenJDK on the community runs). 
    
    In SizeTrackerSuite I think the failures are caused by using 
ThreadLocalRandom and not Random - because with Random we see these tests 
passing again. Not sure how robust SizeTrackerSuite is though.
    
    For performance testing I've used HiBench, large profile, with one executor 
ranging from 10g to 25g, experimenting with fixed and dynamic heaps. The Spark 
code I've based my results on is from December the 1st (master branch, so 2.1.0 
snapshot).
    
    More details on the optimisations (this being phase one and JDK agnostic) 
at www.spark.tc/improvements-to-the-sizeestimator-class

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/a-roberts/spark patch-12

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16196.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16196
    
----
commit 50af8fc224cb5acb19a7b55d31ee92b44c96f26f
Author: Adam Roberts <[email protected]>
Date:   2016-12-07T10:32:37Z

    [SPARK-18231] Optimise SizeEstimator implementation
    
    Several improvements to the SizeEstimator for performance, most of the 
benefit comes from, when estimating, contending to not contending on multiple 
threads. There can be a small boost in uncontended scenarios from the removal 
of the synchronisation code but the cost of that synchronisation when not truly 
contended is low. On the PageRank workload for HiBench we see 49~ second 
durations reduced to ~41 second durations. I don't see any changes for other 
workloads. Observed with both IBM's SDK for Java and OpenJDK.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #16196: [SPARK-18231] Optimise SizeEstimator implementati...

Reply via email to