Github user steveloughran commented on the pull request:
https://github.com/apache/spark/pull/11118#issuecomment-183262779
Good Q. We thought it'd be simple at first too.
1. We need a notion of "out-of-dateness" which (a) supports different back
ends, and (b) works reliably for files stored in hdfs:// and other filesystems
(not handled against S3 or other object stores, but that's because they only
save their data on a close(), that is: the end of a successful application.
1. The google cache class is, well, limited. Essentially what we are doing
is adding a probe to the cache entries which is triggered on retrieval, which
can then cause a new web UI to be loaded.
1. the current probe comes from the fs provider. Initially the patch looked
at modification timestamps, but that proved unreliable (modtime granularity and
issues about when it actually becomes visible in the namenode). Hence, a move
to file length.
1. The timeline provider, which I'm no working on elsewhere, does a GET of
the timeline server metadata for that instance, looks at an event count pushed
up there. That one is going to add a bit of a window on checks too, (somehow),
to keep load on Yarn timeline server down.
1. We need to trigger an update check on GETs all the way down the UI. The
way the servlet API works, something that still expects to be configured by
`web.xml`, that's hard to do without singletons, hence the singleton at the
bottom.
1. Finally there's some metrics of what's going on. SPARK-11373 adds
metrics to the history server, of which this becomes a part.
1. Oh, and then there's the tests. They actually use the metrics as the
grey-box view into the cache, ensures that the metrics actually get written,
and that they'll remain stable over time. Break the metrics and the tests fail,
so you find out before ops teams come after you.
There's actually two other bigger things which would be possible to do on
this chain
1. incremental playback of changes. Rather than replay an entire app's
history, start from where you left off. (i.e. file.length()+1). Maybe I'll look
at that sometime, as it would really benefit streaming work.
1. something that works on object stores. There'd I'd go for spark
application instances to write to HDFS, with a copy to S3 on completion âand
the history provider to be able to (a) scan both dirs, (b) do the copy if the
app is no longer running (i.e. fails while declared incomplete). That's not on
my todo list.
Oh, and faster boot time with a summary file alongside the full history,
with main details (finished: Boolean, spark-version, ...) so that the boot time
goes from O(apps*events) to O(apps)
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]