----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/25614/ -----------------------------------------------------------
(Updated Sept. 15, 2014, 1:54 p.m.) Review request for mesos and Niklas Nielsen. Bugs: MESOS-1795 https://issues.apache.org/jira/browse/MESOS-1795 Repository: mesos-git Description ------- ## OVERVIEW This change adds a measure of safety for calls to `Future<T>::await()` without a timeout from within the JNI interface to the state abstraction. ## MOTIVATION Observed the following log output prior to a crash of the Marathon scheduler: ``` Sep 12 23:46:01 highly-available-457-540 marathon[11494]: F0912 23:46:01.771927 11532 org_apache_mesos_state_AbstractState.cpp:145] CHECK_READY(*future): is PENDING Sep 12 23:46:01 highly-available-457-540 marathon[11494]: *** Check failure stack trace: *** Sep 12 23:46:01 highly-available-457-540 marathon[11494]: @ 0x7febc2663a2d google::LogMessage::Fail() Sep 12 23:46:01 highly-available-457-540 marathon[11494]: @ 0x7febc26657e3 google::LogMessage::SendToLog() Sep 12 23:46:01 highly-available-457-540 marathon[11494]: @ 0x7febc2663648 google::LogMessage::Flush() Sep 12 23:46:01 highly-available-457-540 marathon[11494]: @ 0x7febc266603e google::LogMessageFatal::~LogMessageFatal() Sep 12 23:46:01 highly-available-457-540 marathon[11494]: @ 0x7febc26588a3 Java_org_apache_mesos_state_AbstractState__1_1fetch_1get Sep 12 23:46:01 highly-available-457-540 marathon[11494]: @ 0x7febcd107d98 (unknown) ``` _Listing 1: Crash log output._ ## CHANGES IN DETAIL The in-line comments in `Latch::await(const Duration& duration)`, mention that it's possible for the call to return before the underlying process has completed for two reasons: ``` ... process::wait(pid, duration); // Explict to disambiguate. // It's possible that we failed to wait because: // (1) Our process has already terminated. // (2) We timed out (i.e., duration was not "infinite"). ... ``` _Listing 2: Excerpt from 3rdparty/src/libprocess/latch.cpp._ Call sites within `org_apache_mesos_state_AbstractState.cpp` that formerly discarded the return value now throw a `java.concurrent.ExecutionException` if the await fails. I chose to throw this instead of a `java.concurrent.TimeoutException` since the expectation from the caller's perspective is to block forever pending a ready state. Diffs ----- src/java/jni/org_apache_mesos_state_AbstractState.cpp 1accc8a Diff: https://reviews.apache.org/r/25614/diff/ Testing ------- - make - make check Thanks, Connor Doyle