Github user steveloughran commented on the pull request:

    https://github.com/apache/spark/pull/6935#issuecomment-161950768
  
    bq. I agree with the complexity; the initial pass "modtime" just didn't 
work, with a granularity of a couple of seconds on some filesystems, changes 
were sometimes not being detected at all, especially with 2 jobs back-to-back. 
Tracking filesizes did work, but it was still having problems. I spent a lot of 
time looking at debug level logs comparing timestamps on log messages with 
those of cache & provider entries, as well as FS values. Run the test at debug 
level and you'll see the details and diagnostics I had to put in.
    
    one particular deal breaker was that the "when-ness" of metadata updates 
became visible was potentially trouble. I could sketch out a sequence of 
operations where the writer updated a file (which got a new modtime & size), 
but that change didn't immediately become visible. Reader does a read of a file 
& gets the old time, while recording the time (> modtime) of the the read. 
Polling back would not pick up the changes, as the now-updated information was 
appearing with older timestamps. A generation counters of file size changes 
took away all the ambiguity.
    
    w.r.t the continuous streaming of data, I didn't try that here, I just went 
for eviction. I'm not confident that it will scale to the really big clusters 
with many spark jobs running simultaneously, long-lived streaming things, short 
lived analytics works running behind hive queries, etc, etc. 
    
    This cache eviction should scale as
    
    1. It only looks at jobs that are being actively viewed by users.
    2. the cost of the poll/refresh operation itself is very low cost. A 
directory scan.
    2. you can set a window for how long an incomplete app can live in the FS 
before even polling for changes. On a big cluster, you could set it to a couple 
of hours.
    
    Actually, on the very big clusters we may want to think about disabling 
eviction entirely. That won't pick up the transition of an incomplete app to a 
complete one: we may need a way for a provider to detect —in a provider 
specific manner— whether or not an attempt/app has completed, even without 
the poll.
    
    Now that `EventLogListener.stop()` explicitly sets the modtime on files, 
this should happen on all filesystems. I should just order the test sequence so 
that the listing checks take place *before any GET calls on the attempt UI*.
    
    If someone really does want to do incremental playback, I think what we 
have here would be the foundation for a very slick way of doing it: record the 
position in the history, and start from there. That `Option[Any]` could be 
filled with whatever static information is needed to start the replay, and on a 
poll the provider could just play the new data in —then return false to say 
"don't evict this SparkUI". That'd be pretty slick as there'd be no background 
threads, (cost O(incomplete-apps) & expensive), just a load dependent on user 
interaction O(users), replay cost O(delta-since-last-update), and whatever 
static memory gets used up, which is presumably some PSPACE thing, still 
O(incomplete-apps), but less than the cost of a thread in both terms of memory 
allocation and CPU load/swap hit, etc.
    
    I'm not going to do that, at least not here. What this patch can do is act 
as a starting point for this though: on-demand incremental updates on 
incomplete apps
    
    
    If someone wanted to take up I promise I'll help review the code


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to