[
https://issues.apache.org/jira/browse/SPARK-13988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Apache Spark reassigned SPARK-13988:
------------------------------------
Assignee: Apache Spark
> Large history files block new applications from showing up in History UI.
> -------------------------------------------------------------------------
>
> Key: SPARK-13988
> URL: https://issues.apache.org/jira/browse/SPARK-13988
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core
> Affects Versions: 1.6.1
> Reporter: Parth Brahmbhatt
> Assignee: Apache Spark
>
> Some of our Spark users complain that their application was not showing up in
> history server UI. Our analysis suggests that this is a side effect of some
> application’s event log being too big. This is especially true for spark ML
> applications that may have lot of iterations but is applicable to other kind
> of spark jobs too. For example on my local machine just running the following
> generates an event log of size 80MB.
> {code}
> ./spark-shell --master yarn --deploy-mode client --conf
> spark.eventLog.enabled=true --conf
> spark.eventLog.dir=hdfs://localhost:9000/tmp/spark-events
> val words = sc.textFile(“test.txt”)
> for(i <- 1 to 10000) words.count
> sc.close
> {code}
> For one of our user this file was as big as 12GB. He was running logistic
> regression using spark ML. Given each application generates its own
> application event log and event logs are processed serially in a single
> thread, one huge application can result in lot of users not being able to
> view their application on the main UI. To overcome this issue I propose to
> make the replay execution multi threaded so a single large event log won’t
> block other applications from being rendered into UI. This still cannot solve
> the issue completely if there are too many large event logs but the
> alternatives I have considered (Read chunks from begin and end to get
> Application Start and End event, Modify the event log format so it has this
> info in header or footer) are all more intrusive.
> In addition there are several other things we can do to improve History
> Server implementation.
> * During the log checker phase to identify application start and end time the
> replaying thread processes the whole event log and throws away all the info
> apart from application start and end event. This is pretty huge waste given
> as soon as a user clicks on the application we reprocess the same event log
> to get job/task details. We should either optimize the first level of parsing
> so it reads some chunks from beginning and end to identify the application
> level details or better yet cache the job/task level details when we process
> the file for the first time.
> * On the details job page there is no pagination and we only show the last
> 1000 job events when there are > 1000 job events. Granted when users have
> more than 1K jobs they probably won't page through them but not even having
> that option is bad experience. Also if that page is paginated we could
> probably do away with partial processing of the event log until the user
> wants to view the next page. This can help in cases where processing really
> large files causes OOM issues as we will only be processing a subset of the
> file.
> * On startup, the history server reprocesses the whole event log. For the top
> level application details, we could persist the processing results from the
> last run in a more compact and searchable format to improve the bootstrap
> time. This is briefly mentioned in SPARK-6951.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]