[GitHub] spark pull request #15743: [SPARK-18236] Reduce duplicate objects in Spark U...

JoshRosen Wed, 02 Nov 2016 18:25:16 -0700

GitHub user JoshRosen opened a pull request:

    https://github.com/apache/spark/pull/15743


    [SPARK-18236] Reduce duplicate objects in Spark UI and HistoryServer

    ## What changes were proposed in this pull request?
    
    When profiling heap dumps from the HistoryServer and live Spark web UIs, I 
found a large amount of memory being wasted on duplicated objects and strings. 
This patch's changes remove most of this duplication, resulting in over 40% 
memory savings for some benchmarks.
    
    - **Task metrics** (6441f0624dfcda9c7193a64bfb416a145b5aabdf): previously, 
every `TaskUIData` object would have its own instances of `InputMetricsUIData`, 
`OutputMetricsUIData`, `ShuffleReadMetrics`, and `ShuffleWriteMetrics`, but for 
many tasks these metrics are irrelevant because they're all zero. This patch 
changes how we construct these metrics in order to re-use a single immutable 
"empty" value for the cases where these metrics are empty.
    - **TaskInfo.accumulables** (ade86db901127bf13c0e0bdc3f09c933a093bb76): 
Previously, every `TaskInfo` object had its own empty `ListBuffer` for holding 
updates from named accumulators. Tasks which didn't use named accumulators 
still paid for the cost of allocating and storing this empty buffer. To avoid 
this overhead, I changed the `val` with a mutable buffer into a `var` which 
holds an immutable Scala list, allowing tasks which do not have named 
accumulator updates to share the same singleton `Nil` object.
    - **String.intern() in JSONProtocol** 
(7e05630e9a78c455db8c8c499f0590c864624e05): in the HistoryServer, executor 
hostnames and ids are deserialized from JSON, leading to massive duplication of 
these string objects. By calling `String.intern()` on the deserialized values, 
we can remove all of this duplication. Since Spark now requires Java 7+ we 
don't have to worry about string interning exhausting the permgen (see 
http://java-performance.info/string-intern-in-java-6-7-8/).
    
    ## How was this patch tested?
    
    I ran 
    
    ```
    sc.parallelize(1 to 100000, 100000).count()
    ```
    
    in `spark-shell` with event logging enabled, then loaded that event log in 
the HistoryServer, performed a full GC, and took a heap dump. According to 
YourKit, the changes in this patch reduced memory consumption by roughly 28 
megabytes (or 770k Java objects):
    
    
![image](https://cloud.githubusercontent.com/assets/50748/19953276/4f3a28aa-a129-11e6-93df-d7fa91396f66.png)
    
    Here's a table illustrating the drop in objects due to deduplication (the 
drop is <100k for some objects because some events were dropped from the 
listener bus; this is a separate, existing bug that I'll address separately 
after CPU-profiling):
    
    
![image](https://cloud.githubusercontent.com/assets/50748/19953290/6a271290-a129-11e6-93ad-b825f1448886.png)


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/JoshRosen/spark spark-ui-memory-usage

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/15743.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #15743
    
----
commit 6441f0624dfcda9c7193a64bfb416a145b5aabdf
Author: Josh Rosen <[email protected]>
Date:   2016-11-02T21:00:16Z

    Re-use same instance for empty metrics UI data objects.

commit ade86db901127bf13c0e0bdc3f09c933a093bb76
Author: Josh Rosen <[email protected]>
Date:   2016-11-03T00:12:09Z

    Change TaskInfo.accumulables into an immutable List.

commit 7e05630e9a78c455db8c8c499f0590c864624e05
Author: Josh Rosen <[email protected]>
Date:   2016-11-03T00:27:06Z

    Intern hostname and executor id strings in blockManagerId and taskInfo JSON 
protocol.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #15743: [SPARK-18236] Reduce duplicate objects in Spark U...

Reply via email to