GitHub user JoshRosen opened a pull request:

    https://github.com/apache/spark/pull/15756

    [SPARK-18256] Improve the performance of event log replay in HistoryServer

    ## What changes were proposed in this pull request?
    
    This patch significantly improves the performance of event log replay in 
the HistoryServer via two simple changes:
    
    - **Don't use `extractOpt`**: it turns out that `json4s`'s `extractOpt` 
method uses exceptions for control flow, causing huge performance bottlenecks 
when initializing exceptions. To avoid this overhead, we can simply use our 
own` Utils.jsonOption` method instead. This patch replaces all uses of 
`extractOpt` with `Utils.jsonOption` and adds a style checker rule to ban the 
use of the slow `extractOpt` method.
    - **Don't call `Utils.getFormattedClassName` for every event**: the old 
code called` Utils.getFormattedClassName` dozens of times per replayed event in 
order to match up class names in events with SparkListener event names. By 
simply storing the results of these calls in constants rather than recomputing 
them, we're able to eliminate a huge performance hotspot by removing thousands 
of expensive `Class.getSimpleName` calls.
    
    ## How was this patch tested?
    
    Tested by profiling the replay of a long event log using YourKit. For an 
event log containing 1000+ jobs, each of which had thousands of tasks, the 
changes in this patch cut the replay time in half:
    
    
![image](https://cloud.githubusercontent.com/assets/50748/19980953/31154622-a1bd-11e6-9be4-21fbb9b3f9a7.png)
    
    Prior to this patch's changes, the two slowest methods in log replay were 
internal exceptions thrown by `Json4S` and calls to `Class.getSimpleName()`:
    
    
![image](https://cloud.githubusercontent.com/assets/50748/19981052/87416cce-a1bd-11e6-9f25-06a7cd391822.png)
    
    After this patch, these hotspots are completely eliminated.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/JoshRosen/spark speed-up-jsonprotocol

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/15756.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #15756
    
----
commit 9861e1d641b1dee177172416663eb5953c719b59
Author: Josh Rosen <[email protected]>
Date:   2016-11-03T02:32:47Z

    Use Utils.jsonOption instead of extractOpt.

commit aec09d57659c397f92f304af0ae8257bd33b4485
Author: Josh Rosen <[email protected]>
Date:   2016-11-03T02:43:11Z

    Only get formatted class names once.

commit 2717f79b219295a9ba880e759fc454b8d0693859
Author: Josh Rosen <[email protected]>
Date:   2016-11-03T03:04:07Z

    Use constants on write path as well.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to