-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/25614/
-----------------------------------------------------------

(Updated Sept. 15, 2014, 1:54 p.m.)


Review request for mesos and Niklas Nielsen.


Bugs: MESOS-1795
    https://issues.apache.org/jira/browse/MESOS-1795


Repository: mesos-git


Description
-------

## OVERVIEW

This change adds a measure of safety for calls to `Future<T>::await()`
without a timeout from within the JNI interface to the state
abstraction.

## MOTIVATION

Observed the following log output prior to a crash of the Marathon scheduler:

```
Sep 12 23:46:01 highly-available-457-540 marathon[11494]: F0912 23:46:01.771927 
11532 org_apache_mesos_state_AbstractState.cpp:145] CHECK_READY(*future): is 
PENDING 
Sep 12 23:46:01 highly-available-457-540 marathon[11494]: *** Check failure 
stack trace: ***
Sep 12 23:46:01 highly-available-457-540 marathon[11494]:     @     
0x7febc2663a2d  google::LogMessage::Fail()
Sep 12 23:46:01 highly-available-457-540 marathon[11494]:     @     
0x7febc26657e3  google::LogMessage::SendToLog()
Sep 12 23:46:01 highly-available-457-540 marathon[11494]:     @     
0x7febc2663648  google::LogMessage::Flush()
Sep 12 23:46:01 highly-available-457-540 marathon[11494]:     @     
0x7febc266603e  google::LogMessageFatal::~LogMessageFatal()
Sep 12 23:46:01 highly-available-457-540 marathon[11494]:     @     
0x7febc26588a3  Java_org_apache_mesos_state_AbstractState__1_1fetch_1get
Sep 12 23:46:01 highly-available-457-540 marathon[11494]:     @     
0x7febcd107d98  (unknown)
```
_Listing 1: Crash log output._

## CHANGES IN DETAIL

The in-line comments in `Latch::await(const Duration& duration)`, mention
that it's possible for the call to return before the underlying process
has completed for two reasons:

```
...
process::wait(pid, duration); // Explict to disambiguate.
// It's possible that we failed to wait because:
//   (1) Our process has already terminated.
//   (2) We timed out (i.e., duration was not "infinite").
...
```
_Listing 2: Excerpt from 3rdparty/src/libprocess/latch.cpp._

Call sites within `org_apache_mesos_state_AbstractState.cpp`
that formerly discarded the return value now throw a
`java.concurrent.ExecutionException` if the await fails.  I chose to throw
this instead of a `java.concurrent.TimeoutException` since the expectation
from the caller's perspective is to block forever pending a ready state.


Diffs
-----

  src/java/jni/org_apache_mesos_state_AbstractState.cpp 1accc8a 

Diff: https://reviews.apache.org/r/25614/diff/


Testing
-------

- make
- make check


Thanks,

Connor Doyle

Reply via email to