GitHub user JoshRosen opened a pull request:
https://github.com/apache/spark/pull/15756
[SPARK-18256] Improve the performance of event log replay in HistoryServer
## What changes were proposed in this pull request?
This patch significantly improves the performance of event log replay in
the HistoryServer via two simple changes:
- **Don't use `extractOpt`**: it turns out that `json4s`'s `extractOpt`
method uses exceptions for control flow, causing huge performance bottlenecks
when initializing exceptions. To avoid this overhead, we can simply use our
own` Utils.jsonOption` method instead. This patch replaces all uses of
`extractOpt` with `Utils.jsonOption` and adds a style checker rule to ban the
use of the slow `extractOpt` method.
- **Don't call `Utils.getFormattedClassName` for every event**: the old
code called` Utils.getFormattedClassName` dozens of times per replayed event in
order to match up class names in events with SparkListener event names. By
simply storing the results of these calls in constants rather than recomputing
them, we're able to eliminate a huge performance hotspot by removing thousands
of expensive `Class.getSimpleName` calls.
## How was this patch tested?
Tested by profiling the replay of a long event log using YourKit. For an
event log containing 1000+ jobs, each of which had thousands of tasks, the
changes in this patch cut the replay time in half:

Prior to this patch's changes, the two slowest methods in log replay were
internal exceptions thrown by `Json4S` and calls to `Class.getSimpleName()`:

After this patch, these hotspots are completely eliminated.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/JoshRosen/spark speed-up-jsonprotocol
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/15756.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #15756
----
commit 9861e1d641b1dee177172416663eb5953c719b59
Author: Josh Rosen <[email protected]>
Date: 2016-11-03T02:32:47Z
Use Utils.jsonOption instead of extractOpt.
commit aec09d57659c397f92f304af0ae8257bd33b4485
Author: Josh Rosen <[email protected]>
Date: 2016-11-03T02:43:11Z
Only get formatted class names once.
commit 2717f79b219295a9ba880e759fc454b8d0693859
Author: Josh Rosen <[email protected]>
Date: 2016-11-03T03:04:07Z
Use constants on write path as well.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]