GitHub user JoshRosen opened a pull request:
https://github.com/apache/spark/pull/15743
[SPARK-18236] Reduce duplicate objects in Spark UI and HistoryServer
## What changes were proposed in this pull request?
When profiling heap dumps from the HistoryServer and live Spark web UIs, I
found a large amount of memory being wasted on duplicated objects and strings.
This patch's changes remove most of this duplication, resulting in over 40%
memory savings for some benchmarks.
- **Task metrics** (6441f0624dfcda9c7193a64bfb416a145b5aabdf): previously,
every `TaskUIData` object would have its own instances of `InputMetricsUIData`,
`OutputMetricsUIData`, `ShuffleReadMetrics`, and `ShuffleWriteMetrics`, but for
many tasks these metrics are irrelevant because they're all zero. This patch
changes how we construct these metrics in order to re-use a single immutable
"empty" value for the cases where these metrics are empty.
- **TaskInfo.accumulables** (ade86db901127bf13c0e0bdc3f09c933a093bb76):
Previously, every `TaskInfo` object had its own empty `ListBuffer` for holding
updates from named accumulators. Tasks which didn't use named accumulators
still paid for the cost of allocating and storing this empty buffer. To avoid
this overhead, I changed the `val` with a mutable buffer into a `var` which
holds an immutable Scala list, allowing tasks which do not have named
accumulator updates to share the same singleton `Nil` object.
- **String.intern() in JSONProtocol**
(7e05630e9a78c455db8c8c499f0590c864624e05): in the HistoryServer, executor
hostnames and ids are deserialized from JSON, leading to massive duplication of
these string objects. By calling `String.intern()` on the deserialized values,
we can remove all of this duplication. Since Spark now requires Java 7+ we
don't have to worry about string interning exhausting the permgen (see
http://java-performance.info/string-intern-in-java-6-7-8/).
## How was this patch tested?
I ran
```
sc.parallelize(1 to 100000, 100000).count()
```
in `spark-shell` with event logging enabled, then loaded that event log in
the HistoryServer, performed a full GC, and took a heap dump. According to
YourKit, the changes in this patch reduced memory consumption by roughly 28
megabytes (or 770k Java objects):

Here's a table illustrating the drop in objects due to deduplication (the
drop is <100k for some objects because some events were dropped from the
listener bus; this is a separate, existing bug that I'll address separately
after CPU-profiling):

You can merge this pull request into a Git repository by running:
$ git pull https://github.com/JoshRosen/spark spark-ui-memory-usage
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/15743.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #15743
----
commit 6441f0624dfcda9c7193a64bfb416a145b5aabdf
Author: Josh Rosen <[email protected]>
Date: 2016-11-02T21:00:16Z
Re-use same instance for empty metrics UI data objects.
commit ade86db901127bf13c0e0bdc3f09c933a093bb76
Author: Josh Rosen <[email protected]>
Date: 2016-11-03T00:12:09Z
Change TaskInfo.accumulables into an immutable List.
commit 7e05630e9a78c455db8c8c499f0590c864624e05
Author: Josh Rosen <[email protected]>
Date: 2016-11-03T00:27:06Z
Intern hostname and executor id strings in blockManagerId and taskInfo JSON
protocol.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]