Github user steveloughran commented on the pull request:

    https://github.com/apache/spark/pull/8744#issuecomment-151647359
  
    OK, this is a rework to make the concurrency logic cleaner/clearer; tested 
against Hadoop 2.6.x, 2.7.x and 2.8.0-SNAPSHOT.
    
    The thread which posts events to ATS now takes a case class of actions; the 
actions being "post an event" and "stop". 
    
    1. PostEntity: post the event; if this fails push it back to the front, 
sleep a while.
    1. StopQueue: bails out the normal take+post loop + sleep on fail, and 
enters the fast-shutdown sequence.
    1. the fast shutdown repeatedly retries to post, with a non-increasing 
delay.
    1. After the specified shutdown delay has passed, the thread exits anyway. 
    1. If the thread is interrupted, it exits.
    1. time spent in the regular start/sleep cycle counts against the shutdown 
delay. That is, if the stop event is posted with 30s to go, and the current 
post+sleep takes 10s, there's only 20s left for the faster loop. Given that the 
shutdown time only matters if the back end is being slow/unresponsive, this is 
justifiable, it's all attempts to post data to ATS after all.
    1. The shutdown routine (called from the stop()) operation, generates a 
`SparkApplicationEnd` event if none was seen —this triggers then queues a 
`StopQueue()` operation., after which the caller waits for the (configurable) 
shutdown time, before interrupting the queue thread.
    
    The queue thread treats all interrupts as a non-retrying exception: it 
doesn't swallow them. This is easier said than done, as the Hadoop 2.7 retry 
code in the `TimelineClientImpl` catches these and wraps them directly inside 
an `IOException`. Accordingly, when *any* `Exception` is caught from 
TimelineClientImpl`, its cause is checked for being an `InterruptException`, 
and, if it is, that is raised directly.
    
    I think the process is a lot cleaner now; that shutdown process is the main 
bit of complexity. Right now it won't treat (even nested) interruptions as 
retryable exceptions, and with the delays in the main post counting against 
shutdown time, any Hadoop 2.7+ retry logic happening in there will count 
against the total shutdown delays.
    
    Also
    
    1. As suggested, moved the source code to `yarn/src/history`, with a `main` 
and `test` dir underneath. (#6d396c0)
    1. I've made all the little `AtomicLong` counters of things like events 
queued, posts failed, etc, Codahale `Counter` instances (which wrap an 
optimised-for-low-contention `AtomicLong`),  and register these on startup if 
the provided `SparkContext` has a metrics registry (#4bb6d6a). As a result, all 
these counters are automatically visible as metrics; anyone managing the spark 
application can get the statistics on event workload and failures. I've also 
added time interval tracking of the ATS post operations, so that you'll get 
statistics on what that service performance is.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to