GitHub user andrewor14 opened a pull request:
https://github.com/apache/spark/pull/204
[SPARK-1276] Add a HistoryServer to render persisted UI
Currently, a persisted UI can only be rendered through the standalone
Master. This greatly limits the use case of the new feature of being able to
log the details of a Spark application as events, since many people also run
Spark on Yarn / Mesos.
This PR introduces a new entity called the HistoryServer, which, given a
log directory, keeps track of all completed applications independently of a
Spark Master. Unlike Master, the HistoryServer needs not be running while the
application is still running. It is relatively light-weight in that it only
maintains static information of applications after-the-fact.
To quickly test it out, generate event logs with
```spark.eventLog.enabled=true``` and run ```sbin/start-history-server.sh
<log-dir-path>```. Your HistoryServer awaits on port 18080.
A few other changes introduced in this PR include refactoring the WebUI
interface, which is beginning to have a lot of duplicate code now that we add
more functionality to it. Two new SparkListenerEvents have been introduced
(SparkListenerApplicationStart/End) to keep track of application name and
start/finish times. This PR also clarifies the semantics of the
ReplayListenerBus introduced in #42.
A potential TODO in the future (not part of this PR) is to render live
event logging applications in addition to just completed applications. This is
useful if an application fails, in which case our current HistoryServer does
not render the associated UI unless the user manually signals application
completion. Processing the event logs in this case becomes significantly more
complicated, however, because we must deal with multiple levels of streams that
may each have arbitrary behavior if we want to avoid processing the entire file
over and over again.
Comments and feedback are most welcome.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/andrewor14/spark master
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/204.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #204
----
commit c086bd5c6837a98d3c989c43f2b75aeaa0e5eff0
Author: Andrew Or <[email protected]>
Date: 2014-03-20T19:43:16Z
Add HistoryServer and scripts ++ Refactor WebUI interface
HistoryServer can be launched with ./sbin/start-history-server.sh <log-dir>
and stopped with ./sbin/stop-history-server.sh. This commit also involves
refactoring all the UIs to avoid duplicate code.
commit 8aac16355329809b11c76430fa8737d328f2e962
Author: Andrew Or <[email protected]>
Date: 2014-03-20T21:34:34Z
Add basic application table
commit 758441890dc86c8ed069e6c684b21528038f2ff7
Author: Andrew Or <[email protected]>
Date: 2014-03-21T04:59:34Z
Report application start/end times to HistoryServer
This involves adding application start and end events. This also
allows us to record the actual app name instead of simply using
the name of the directory.
commit 60bc6d57577742e861d62c183ec56d9893e3ea6a
Author: Andrew Or <[email protected]>
Date: 2014-03-22T01:17:43Z
First complete implementation of HistoryServer (only for finished apps)
This involves a change in Spark's event log format. All event logs are
now prefixed with EVENT_LOG_. If compression is used, the logger creates
a special empty file prefixed with COMPRESSION_CODEC_ that indicates which
codec is used. After the application finishes, the logger logs a special
empty file named APPLICATION_COMPLETE.
The ReplayListenerBus is now responsible for parsing all of the above
file formats. In this commit, we establish a one-to-one mapping between
ReplayListenerBus and event logging applications. The semantics of the
ReplayListenerBus is further clarified (e.g. replay is not allowed
before starting, and can only be called once).
This commit also adds a control mechanism for the frequency at which
HistoryServer accesses the disk to check for log updates. This enforces
a minimum interval of N seconds between two checks, where N is arbitrarily
chosen to be 5.
commit 5dbfbb47826ea2edbf8cf2100228bddb5be473f8
Author: Andrew Or <[email protected]>
Date: 2014-03-22T01:54:28Z
Merge branch 'master' of github.com:apache/spark
Conflicts:
core/src/main/scala/org/apache/spark/deploy/DeployWebUI.scala
core/src/main/scala/org/apache/spark/deploy/WebUI.scala
core/src/main/scala/org/apache/spark/deploy/master/Master.scala
core/src/main/scala/org/apache/spark/ui/WebUI.scala
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---