Github user steveloughran commented on the pull request:
https://github.com/apache/spark/pull/8744#issuecomment-151647359
OK, this is a rework to make the concurrency logic cleaner/clearer; tested
against Hadoop 2.6.x, 2.7.x and 2.8.0-SNAPSHOT.
The thread which posts events to ATS now takes a case class of actions; the
actions being "post an event" and "stop".
1. PostEntity: post the event; if this fails push it back to the front,
sleep a while.
1. StopQueue: bails out the normal take+post loop + sleep on fail, and
enters the fast-shutdown sequence.
1. the fast shutdown repeatedly retries to post, with a non-increasing
delay.
1. After the specified shutdown delay has passed, the thread exits anyway.
1. If the thread is interrupted, it exits.
1. time spent in the regular start/sleep cycle counts against the shutdown
delay. That is, if the stop event is posted with 30s to go, and the current
post+sleep takes 10s, there's only 20s left for the faster loop. Given that the
shutdown time only matters if the back end is being slow/unresponsive, this is
justifiable, it's all attempts to post data to ATS after all.
1. The shutdown routine (called from the stop()) operation, generates a
`SparkApplicationEnd` event if none was seen âthis triggers then queues a
`StopQueue()` operation., after which the caller waits for the (configurable)
shutdown time, before interrupting the queue thread.
The queue thread treats all interrupts as a non-retrying exception: it
doesn't swallow them. This is easier said than done, as the Hadoop 2.7 retry
code in the `TimelineClientImpl` catches these and wraps them directly inside
an `IOException`. Accordingly, when *any* `Exception` is caught from
TimelineClientImpl`, its cause is checked for being an `InterruptException`,
and, if it is, that is raised directly.
I think the process is a lot cleaner now; that shutdown process is the main
bit of complexity. Right now it won't treat (even nested) interruptions as
retryable exceptions, and with the delays in the main post counting against
shutdown time, any Hadoop 2.7+ retry logic happening in there will count
against the total shutdown delays.
Also
1. As suggested, moved the source code to `yarn/src/history`, with a `main`
and `test` dir underneath. (#6d396c0)
1. I've made all the little `AtomicLong` counters of things like events
queued, posts failed, etc, Codahale `Counter` instances (which wrap an
optimised-for-low-contention `AtomicLong`), and register these on startup if
the provided `SparkContext` has a metrics registry (#4bb6d6a). As a result, all
these counters are automatically visible as metrics; anyone managing the spark
application can get the statistics on event workload and failures. I've also
added time interval tracking of the ATS post operations, so that you'll get
statistics on what that service performance is.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]