[ 
https://issues.apache.org/jira/browse/SPARK-13988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-13988:
------------------------------------

    Assignee: Apache Spark

> Large history files block new applications from showing up in History UI.
> -------------------------------------------------------------------------
>
>                 Key: SPARK-13988
>                 URL: https://issues.apache.org/jira/browse/SPARK-13988
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core
>    Affects Versions: 1.6.1
>            Reporter: Parth Brahmbhatt
>            Assignee: Apache Spark
>
> Some of our Spark users complain that their application was not showing up in 
> history server UI. Our analysis suggests that this is a side effect of some 
> application’s event log being too big. This is especially true for spark ML 
> applications that may have lot of iterations but is applicable to other kind 
> of spark jobs too. For example on my local machine just running the following 
> generates an event log of size 80MB.
> {code}
> ./spark-shell --master yarn --deploy-mode client --conf 
> spark.eventLog.enabled=true --conf 
> spark.eventLog.dir=hdfs://localhost:9000/tmp/spark-events
> val words = sc.textFile(“test.txt”)
> for(i <- 1 to 10000) words.count
> sc.close 
> {code}
> For one of our user this file was as big as 12GB. He was running logistic 
> regression using spark ML. Given each application generates its own 
> application event log and event logs are processed serially in a single 
> thread, one huge application can result in lot of users not being able to 
> view their application on the main UI. To overcome this issue I propose to 
> make the replay execution multi threaded so a single large event log won’t 
> block other applications from being rendered into UI. This still cannot solve 
> the issue completely if there are too many large event logs but the 
> alternatives I have considered (Read chunks from begin and end  to get 
> Application Start and End event, Modify the event log format so it has this 
> info in header or footer) are all more intrusive. 
> In addition there are several other things we can do to improve History 
> Server implementation. 
> * During the log checker phase to identify application start and end time the 
> replaying thread processes the whole event log and throws away all the info 
> apart from application start and end event. This is pretty huge waste given 
> as soon as a user clicks on the application we reprocess the same event log 
> to get job/task details. We should either optimize the first level of parsing 
> so it reads some chunks from beginning and end to identify the application 
> level details or better yet cache the job/task level details when we process 
> the file for the first time.
> * On the details job page there is no pagination and we only show the last 
> 1000 job events when there are > 1000 job events. Granted when users have 
> more than 1K jobs they probably won't page through them but not even having 
> that option is bad experience. Also if that page is paginated we could 
> probably do away with partial processing of the event log until the user 
> wants to view the next page. This can help in cases where processing really 
> large files causes OOM issues as we will only be processing a subset of the 
> file.
> * On startup, the history server reprocesses the whole event log. For the top 
> level application details, we could persist the processing results from the 
> last run in a more compact and searchable format to improve the bootstrap 
> time. This is briefly mentioned in SPARK-6951.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to