Github user steveloughran commented on the pull request:

    https://github.com/apache/spark/pull/11118#issuecomment-183262779
  
    Good Q. We thought it'd be simple at first too.
    
    1. We need a notion of "out-of-dateness" which (a) supports different back 
ends, and (b) works reliably for files stored in hdfs:// and other filesystems 
(not handled against S3 or other object stores, but that's because they only 
save their data on a close(), that is: the end of a successful application.
    1. The google cache class is, well, limited. Essentially what we are doing 
is adding a probe to the cache entries which is triggered on retrieval, which 
can then cause a new web UI to be loaded.
    1. the current probe comes from the fs provider. Initially the patch looked 
at modification timestamps, but that proved unreliable (modtime granularity and 
issues about when it actually becomes visible in the namenode). Hence, a move 
to file length.
    1. The timeline provider, which I'm no working on elsewhere, does a GET of 
the timeline server metadata for that instance, looks at an event count pushed 
up there. That one is going to add a bit of a window on checks too, (somehow), 
to keep load on Yarn timeline server down.
    1. We need to trigger an update check on GETs all the way down the UI. The 
way the servlet API works, something that still expects to be configured by 
`web.xml`, that's hard to do without singletons, hence the singleton at the 
bottom.
    1. Finally there's some metrics of what's going on. SPARK-11373 adds 
metrics to the history server, of which this becomes a part.
    1. Oh, and then there's the tests. They actually use the metrics as the 
grey-box view into the cache, ensures that the metrics actually get written, 
and that they'll remain stable over time. Break the metrics and the tests fail, 
so you find out before ops teams come after you.
    
    There's actually two other bigger things which would be possible to do on 
this chain
    
    1. incremental playback of changes. Rather than replay an entire app's 
history, start from where you left off. (i.e. file.length()+1). Maybe I'll look 
at that sometime, as it would really benefit streaming work.
    1. something that works on object stores. There'd I'd go for spark 
application instances to write to HDFS, with a copy to S3 on completion —and 
the history provider to be able to (a) scan both dirs, (b) do the copy if the 
app is no longer running (i.e. fails while declared incomplete). That's not on 
my todo list.
    
    Oh, and faster boot time with a summary file alongside the full history, 
with main details (finished: Boolean, spark-version, ...) so that the boot time 
goes from O(apps*events) to O(apps)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to