[ 
https://issues.apache.org/jira/browse/SPARK-18010?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-18010:
------------------------------------

    Assignee: Apache Spark

> Remove unneeded heavy work performed by FsHistoryProvider for building up the 
> application listing UI page
> ---------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-18010
>                 URL: https://issues.apache.org/jira/browse/SPARK-18010
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core, Web UI
>    Affects Versions: 1.6.2, 2.0.1, 2.1.0
>            Reporter: Vinayak Joshi
>            Assignee: Apache Spark
>
> There are known complaints/cribs about History Server's Application List not 
> updating quickly enough when the event log files that need replay are huge. 
> Currently, the FsHistoryProvider design causes the entire event log file to 
> be replayed when building the initial application listing (refer the method 
> mergeApplicationListing(fileStatus: FileStatus) ). The process of replay 
> involves:
>  - each line in the event log being read as a string,
>  - parsing the string to a Json structure
>  - converting the Json to the corresponding Scala classes with nested 
> structures
> Particularly the part involving parsing string to Json and then to Scala 
> classes is expensive. Tests show that majority of time spent in replay is in 
> doing this work. 
> When the replay is performed for building the application listing, the only 
> two events that the code really cares for are "SparkListenerApplicationStart" 
> and "SparkListenerApplicationEnd" - since the only listener attached to the 
> ReplayListenerBus at that point is the ApplicationEventListener. This means 
> that when processing an event log file with a huge number (hundreds of 
> thousands, can be more) of events, the work done to deserialize all of these 
> event,  and then replay them is not needed. Only two events are what we're 
> interested in, and this can be used to ensure that when replay is performed 
> for the purpose of building the application list, we only make the effort to 
> replay these two events and not others. 
> My tests show that this drastically improves application list load time. For 
> a 150MB event log from a user, with over 100,000 events, the load time (local 
> on my mac) comes down from about 16 secs to under 1 second using this 
> approach. For customers that typically execute applications with large event 
> logs, and thus have multiple large event logs present, this can speed up how 
> soon the history server UI lists the apps considerably.
> I will be updating a pull request with take at fixing this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to