Vinayak Joshi created SPARK-18010:
-------------------------------------
Summary: Remove unneeded heavy work performed by FsHistoryProvider
for building up the application listing UI page
Key: SPARK-18010
URL: https://issues.apache.org/jira/browse/SPARK-18010
Project: Spark
Issue Type: Improvement
Components: Spark Core, Web UI
Affects Versions: 2.0.1, 1.6.2, 2.1.0
Reporter: Vinayak Joshi
There are known complaints/cribs about History Server's Application List not
updating quickly enough when the event log files that need replay are huge.
Currently, the FsHistoryProvider design causes the entire event log file to be
replayed when building the initial application listing (refer the method
mergeApplicationListing(fileStatus: FileStatus) ). The process of replay
involves:
- each line in the event log being read as a string,
- parsing the string to a Json structure
- converting the Json to the corresponding Scala classes with nested structures
Particularly the part involving parsing string to Json and then to Scala
classes is expensive. Tests show that majority of time spent in replay is in
doing this work.
When the replay is performed for building the application listing, the only two
events that the code really cares for are "SparkListenerApplicationStart" and
"SparkListenerApplicationEnd" - since the only listener attached to the
ReplayListenerBus at that point is the ApplicationEventListener. This means
that when processing an event log file with a huge number (hundreds of
thousands, can be more) of events, the work done to deserialize all of these
event, and then replay them is not needed. Only two events are what we're
interested in, and this can be used to ensure that when replay is performed for
the purpose of building the application list, we only make the effort to replay
these two events and not others.
My tests show that this drastically improves application list load time. For a
150MB event log from a user, with over 100,000 events, the load time (local on
my mac) comes down from about 16 secs to under 1 second using this approach.
For customers that typically execute applications with large event logs, and
thus have multiple large event logs present, this can speed up how soon the
history server UI lists the apps considerably.
I will be updating a pull request with take at fixing this.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]