Hey Devs,

I got a tough nut to crack and my progress is severely slowed down by this 
beast;

For testing the ExternalContainerizer, I had implemented a shim daemon that 
wraps the MesosContainerizerProcess. That daemon appears to work fine, it is 
the command exchange with that daemon that seems to be the problem. 

The data exchange is done via a variant of process::Protocol. So it sends a 
protobuffed request towards a given UPID and expects a protobuffed answer in 
return. My variant additionally blocks until the returned future becomes 
terminal. I need such blocking/synchronous behavior for implementing the EC/ECP 
interface. 

The problem manifests itself by an occasional SIGABORT within GLOG’s LogMessage 
destructor’s call-path. This is happening in about 10% of all attempts and it 
triggers only when enabling extended verbose logging (GLOG_v=5) on this 
implementation. I was not able to replicate the problem on linux but it 
persists on OSX. This abort happens while the test-containerizer is exiting 
which is after having gone through one round of SyncProtocol. When adding a 
forced delay (sleep) after the SyncProtocol invocation, things behave much 
better, no sigaborts. So this obviously is a race, most likely introduced the 
implementation or by my use of SyncProtocol and its resulting failure for the 
ProcessManager to cleanup all connected threads in a graceful/proper way. 


// Allows you to describe request/response protocols and then use
// those for sending requests and getting back responses. This variant
// of Protocol is working synchronously by blocking until the
// communication process has terminated.
template <typename Req, typename Res>
struct SyncProtocol
{
 Try<Res> operator () (
     const UPID& pid,
     const Req& req) const
 {
   // Help debugging by adding some "type constraints".
   { Req* req = NULL; google::protobuf::Message* m = req; (void)m; }
   { Res* res = NULL; google::protobuf::Message* m = res; (void)m; }

   ReqResProcess<Req, Res>* process = new ReqResProcess<Req, Res>(pid, req);

   VLOG(2) << "ReqResProcess: " << (void *)process;

   process::spawn(process);
   Future<Res> future = dispatch(process, &ReqResProcess<Req, Res>::run);

   future.await();

   process::terminate(process);
   process::wait(process);
   delete process;

   VLOG(2) << "Future is terminal";

   if (!future.isReady()) {
     return Error("Failed to receive a result");
   }

   return future.get();
 }
};

The entire source for that test-containerizer can be found at: 
https://github.com/tillt/mesos/blob/wip4/src/tests/test_containerizer.cpp

These are some typical last words before that crash:
I0526 17:30:15.987010 352854016 process.cpp:2661] Spawned process 
__http__(1)@127.0.0.1:64799
I0526 17:30:15.987015 349097984 process.cpp:2778] Cleaning up 
(2)@127.0.0.1:64799
I0526 17:30:15.987022 352317440 process.cpp:2671] Resuming 
__http__(1)@127.0.0.1:64799 at 2014-05-26 15:30:15.987040000+00:00
I0526 17:30:15.987063 352854016 process.cpp:1037] Socket closed while receiving
I0526 17:30:15.987140 351780864 process.cpp:2671] Resuming 
__http__(1)@127.0.0.1:64799 at 2014-05-26 15:30:15.987156992+00:00
I0526 17:30:15.987200 351780864 process.cpp:2778] Cleaning up 
__http__(1)@127.0.0.1:64799
Abort trap: 6 (core dumped)

(lldb) thread backtrace
* thread #5: tid = 0x0004, 0x00007fff889aa866 
libsystem_kernel.dylib`__pthread_kill + 10, stop reason = signal SIGSTOP
  * frame #0: 0x00007fff889aa866 libsystem_kernel.dylib`__pthread_kill + 10
    frame #1: 0x00007fff8614935c libsystem_pthread.dylib`pthread_kill + 92
    frame #2: 0x00007fff8c7f9b1a libsystem_c.dylib`abort + 125
    frame #3: 0x000000011139c120 
libmesos-0.19.0.dylib`glog_internal_namespace_::Mutex::Lock(this=0x00000001117952d0)
 + 64 at mutex.h:248
    frame #4: 0x000000011139c0d3 
libmesos-0.19.0.dylib`MutexLock(this=0x0000000114e77bd8, mu=0x00000001117952d0) 
+ 35 at mutex.h:288
    frame #5: 0x0000000111398efd 
libmesos-0.19.0.dylib`MutexLock(this=0x0000000114e77bd8, mu=0x00000001117952d0) 
+ 29 at mutex.h:288
    frame #6: 0x000000011138a097 
libmesos-0.19.0.dylib`google::LogMessage::Flush(this=0x0000000114e77e48) + 279 
at logging.cc:1280
    frame #7: 0x0000000111389f38 
libmesos-0.19.0.dylib`~LogMessage(this=0x0000000114e77e48) + 24 at 
logging.cc:1240
    frame #8: 0x0000000111389f15 
libmesos-0.19.0.dylib`~LogMessage(this=0x0000000114e77e48) + 21 at 
logging.cc:1239
    frame #9: 0x00000001110fae9a 
libmesos-0.19.0.dylib`process::ProcessManager::resume(this=0x00007fb859500a40, 
process=0x00007fb8594143a8) + 474 at process.cpp:2671
    frame #10: 0x00000001110fab5e 
libmesos-0.19.0.dylib`process::schedule(arg=0x0000000000000000) + 142 at 
process.cpp:1380
    frame #11: 0x00007fff86148899 libsystem_pthread.dylib`_pthread_body + 138
    frame #12: 0x00007fff8614872a libsystem_pthread.dylib`_pthread_start + 137

 (lldb) frame select 3
frame #3: 0x000000011139c120 
libmesos-0.19.0.dylib`glog_internal_namespace_::Mutex::Lock(this=0x00000001117952d0)
 + 64 at mutex.h:248
   245    if (is_safe_ && pthread_rwlock_init(&mutex_, NULL) != 0) abort();
   246  }
   247  Mutex::~Mutex()            { SAFE_PTHREAD(pthread_rwlock_destroy); }
-> 248  void Mutex::Lock()         { SAFE_PTHREAD(pthread_rwlock_wrlock); }
   249  void Mutex::Unlock()       { SAFE_PTHREAD(pthread_rwlock_unlock); }
   250  #ifdef GMUTEX_TRYLOCK
   251  bool Mutex::TryLock()      { return is_safe_ ?

The mutex that is tried to get locked at this point has already gotten 
destroyed, hence a SIGABRT trigger as pthread_rwlock_wrlock returns 22 (Invalid 
Argument):

#define SAFE_PTHREAD(fncall)  do {   /* run fncall if is_safe_ is true */  \
  if (is_safe_ && fncall(&mutex_) != 0) abort();                           \
} while (0)


What is it that could cause this GLOG behavior?

Thanks for any hint!
Till

Reply via email to