Hey Devs,
I got a tough nut to crack and my progress is severely slowed down by this
beast;
For testing the ExternalContainerizer, I had implemented a shim daemon that
wraps the MesosContainerizerProcess. That daemon appears to work fine, it is
the command exchange with that daemon that seems to be the problem.
The data exchange is done via a variant of process::Protocol. So it sends a
protobuffed request towards a given UPID and expects a protobuffed answer in
return. My variant additionally blocks until the returned future becomes
terminal. I need such blocking/synchronous behavior for implementing the EC/ECP
interface.
The problem manifests itself by an occasional SIGABORT within GLOG’s LogMessage
destructor’s call-path. This is happening in about 10% of all attempts and it
triggers only when enabling extended verbose logging (GLOG_v=5) on this
implementation. I was not able to replicate the problem on linux but it
persists on OSX. This abort happens while the test-containerizer is exiting
which is after having gone through one round of SyncProtocol. When adding a
forced delay (sleep) after the SyncProtocol invocation, things behave much
better, no sigaborts. So this obviously is a race, most likely introduced the
implementation or by my use of SyncProtocol and its resulting failure for the
ProcessManager to cleanup all connected threads in a graceful/proper way.
// Allows you to describe request/response protocols and then use
// those for sending requests and getting back responses. This variant
// of Protocol is working synchronously by blocking until the
// communication process has terminated.
template <typename Req, typename Res>
struct SyncProtocol
{
Try<Res> operator () (
const UPID& pid,
const Req& req) const
{
// Help debugging by adding some "type constraints".
{ Req* req = NULL; google::protobuf::Message* m = req; (void)m; }
{ Res* res = NULL; google::protobuf::Message* m = res; (void)m; }
ReqResProcess<Req, Res>* process = new ReqResProcess<Req, Res>(pid, req);
VLOG(2) << "ReqResProcess: " << (void *)process;
process::spawn(process);
Future<Res> future = dispatch(process, &ReqResProcess<Req, Res>::run);
future.await();
process::terminate(process);
process::wait(process);
delete process;
VLOG(2) << "Future is terminal";
if (!future.isReady()) {
return Error("Failed to receive a result");
}
return future.get();
}
};
The entire source for that test-containerizer can be found at:
https://github.com/tillt/mesos/blob/wip4/src/tests/test_containerizer.cpp
These are some typical last words before that crash:
I0526 17:30:15.987010 352854016 process.cpp:2661] Spawned process
__http__(1)@127.0.0.1:64799
I0526 17:30:15.987015 349097984 process.cpp:2778] Cleaning up
(2)@127.0.0.1:64799
I0526 17:30:15.987022 352317440 process.cpp:2671] Resuming
__http__(1)@127.0.0.1:64799 at 2014-05-26 15:30:15.987040000+00:00
I0526 17:30:15.987063 352854016 process.cpp:1037] Socket closed while receiving
I0526 17:30:15.987140 351780864 process.cpp:2671] Resuming
__http__(1)@127.0.0.1:64799 at 2014-05-26 15:30:15.987156992+00:00
I0526 17:30:15.987200 351780864 process.cpp:2778] Cleaning up
__http__(1)@127.0.0.1:64799
Abort trap: 6 (core dumped)
(lldb) thread backtrace
* thread #5: tid = 0x0004, 0x00007fff889aa866
libsystem_kernel.dylib`__pthread_kill + 10, stop reason = signal SIGSTOP
* frame #0: 0x00007fff889aa866 libsystem_kernel.dylib`__pthread_kill + 10
frame #1: 0x00007fff8614935c libsystem_pthread.dylib`pthread_kill + 92
frame #2: 0x00007fff8c7f9b1a libsystem_c.dylib`abort + 125
frame #3: 0x000000011139c120
libmesos-0.19.0.dylib`glog_internal_namespace_::Mutex::Lock(this=0x00000001117952d0)
+ 64 at mutex.h:248
frame #4: 0x000000011139c0d3
libmesos-0.19.0.dylib`MutexLock(this=0x0000000114e77bd8, mu=0x00000001117952d0)
+ 35 at mutex.h:288
frame #5: 0x0000000111398efd
libmesos-0.19.0.dylib`MutexLock(this=0x0000000114e77bd8, mu=0x00000001117952d0)
+ 29 at mutex.h:288
frame #6: 0x000000011138a097
libmesos-0.19.0.dylib`google::LogMessage::Flush(this=0x0000000114e77e48) + 279
at logging.cc:1280
frame #7: 0x0000000111389f38
libmesos-0.19.0.dylib`~LogMessage(this=0x0000000114e77e48) + 24 at
logging.cc:1240
frame #8: 0x0000000111389f15
libmesos-0.19.0.dylib`~LogMessage(this=0x0000000114e77e48) + 21 at
logging.cc:1239
frame #9: 0x00000001110fae9a
libmesos-0.19.0.dylib`process::ProcessManager::resume(this=0x00007fb859500a40,
process=0x00007fb8594143a8) + 474 at process.cpp:2671
frame #10: 0x00000001110fab5e
libmesos-0.19.0.dylib`process::schedule(arg=0x0000000000000000) + 142 at
process.cpp:1380
frame #11: 0x00007fff86148899 libsystem_pthread.dylib`_pthread_body + 138
frame #12: 0x00007fff8614872a libsystem_pthread.dylib`_pthread_start + 137
(lldb) frame select 3
frame #3: 0x000000011139c120
libmesos-0.19.0.dylib`glog_internal_namespace_::Mutex::Lock(this=0x00000001117952d0)
+ 64 at mutex.h:248
245 if (is_safe_ && pthread_rwlock_init(&mutex_, NULL) != 0) abort();
246 }
247 Mutex::~Mutex() { SAFE_PTHREAD(pthread_rwlock_destroy); }
-> 248 void Mutex::Lock() { SAFE_PTHREAD(pthread_rwlock_wrlock); }
249 void Mutex::Unlock() { SAFE_PTHREAD(pthread_rwlock_unlock); }
250 #ifdef GMUTEX_TRYLOCK
251 bool Mutex::TryLock() { return is_safe_ ?
The mutex that is tried to get locked at this point has already gotten
destroyed, hence a SIGABRT trigger as pthread_rwlock_wrlock returns 22 (Invalid
Argument):
#define SAFE_PTHREAD(fncall) do { /* run fncall if is_safe_ is true */ \
if (is_safe_ && fncall(&mutex_) != 0) abort(); \
} while (0)
What is it that could cause this GLOG behavior?
Thanks for any hint!
Till