GitHub user a-roberts opened a pull request:
https://github.com/apache/spark/pull/16196
[SPARK-18231] Optimise SizeEstimator implementation
## What changes were proposed in this pull request?
Several improvements to the SizeEstimator for performance, most of the
benefit comes from, when estimating, contending to not contending on multiple
threads. There can be a small boost in uncontended scenarios from the removal
of the synchronisation code but the cost of that synchronisation when not truly
contended is low. On the PageRank workload for HiBench we see 10-15%
performance improvements (measuring elapsed times on average) with both IBM's
SDK for Java and OpenJDK 8. I don't see any changes other than noise for the
other workloads on this benchmark.
## How was this patch tested?
Existing unit tests but there are problems to resolve.
I see SizeEstimatorSuite and SizeTrackerSuite failing with at least IBM
Java now due to smaller sizes being reported than the test expects (let's see
what happens with OpenJDK on the community runs).
In SizeTrackerSuite I think the failures are caused by using
ThreadLocalRandom and not Random - because with Random we see these tests
passing again. Not sure how robust SizeTrackerSuite is though.
For performance testing I've used HiBench, large profile, with one executor
ranging from 10g to 25g, experimenting with fixed and dynamic heaps. The Spark
code I've based my results on is from December the 1st (master branch, so 2.1.0
snapshot).
More details on the optimisations (this being phase one and JDK agnostic)
at www.spark.tc/improvements-to-the-sizeestimator-class
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/a-roberts/spark patch-12
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/16196.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #16196
----
commit 50af8fc224cb5acb19a7b55d31ee92b44c96f26f
Author: Adam Roberts <[email protected]>
Date: 2016-12-07T10:32:37Z
[SPARK-18231] Optimise SizeEstimator implementation
Several improvements to the SizeEstimator for performance, most of the
benefit comes from, when estimating, contending to not contending on multiple
threads. There can be a small boost in uncontended scenarios from the removal
of the synchronisation code but the cost of that synchronisation when not truly
contended is low. On the PageRank workload for HiBench we see 49~ second
durations reduced to ~41 second durations. I don't see any changes for other
workloads. Observed with both IBM's SDK for Java and OpenJDK.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]