I've been playing around with some examples in contrib/python on the latest
master.

This lead me to discover the minor issue fixed in 510c026.

However, as I pressed further, I found that any checkpoints I tried to
initiate within python were not finishing (the function call never
returned).  Through the wonders of git-bisect, I ended up squarely on
823d096.

Upon inspection, I discovered:

-void dmtcp::userHookTrampoline_postCkpt(bool isRestart)
-{
-  //this function runs before other threads are resumed
-  if(isRestart){
-    numRestarts++;
-    if(userHookPostRestart != NULL)
-      (*userHookPostRestart)();
-  }else{
-    numCheckpoints++;
-    if(userHookPostCheckpoint != NULL)
-      (*userHookPostCheckpoint)();
-  }
-}

The removal of userHookTrampoline_postCkpt also removes calls which
increment the static variables numRestarts and numCheckpoints.

These are used later in dmtcpplugin.cpp @ 113:

  if(dmtcpRunCommand('c')){ //request checkpoint
    //and wait for the checkpoint
    while(oldNumRestarts==numRestarts && oldNumCheckpoints==numCheckpoints){
      //nanosleep should get interrupted by checkpointing with an EINTR
error
      //though there is a race to get to nanosleep() before the checkpoint
      struct timespec t = {1,0};
      nanosleep(&t, NULL);
      memfence();  //make sure the loop condition doesn't get optimized
    }
    rv = (oldNumRestarts==numRestarts ? DMTCP_AFTER_CHECKPOINT :
DMTCP_AFTER_RESTART);
  }

The logic here suggests that if either numRestarts or numCheckpoints is not
incremented, you will be stuck in the while loop.  Having grep'd the code,
I can't find any other place where the incrementing happens.  I also recall
from running strace that my checkpoint was stuck making nanosleep calls.
So, everything seems to confirm my suspicions.

I'm not sure the proper solution.  I added a quick snippet to
dmtcpplugin.cpp:

void dmtcp::increment_counters(bool isRestart)
{
  if (isRestart)
  {
    numRestarts++;
  }
  else
  {
    numCheckpoints++;
  }
}

and called it from mtcpinterface.cpp:

  DmtcpWorker::waitForStage4Resume(isRestart);

  WorkerState::setCurrentState( WorkerState::RUNNING );

  increment_counters(isRestart);

  if (dmtcp_is_ptracing == NULL || !dmtcp_is_ptracing()) {
          // Inform Coordinator of our RUNNING state;
          // If running under ptrace, lets do this in sleep-between-ckpt
callback
          DmtcpWorker::informCoordinatorOfRUNNINGState();
  }

This did the trick, but I'm not sure it is the cleanest solution, or even
what the authors intended.

I would appreciate of someone can confirm this is an issue or otherwise let
me know if I'm in error?

I'm happy to help implement a fix, but I may need to study the code more to
come up with something more appropriate than the above hack.
------------------------------------------------------------------------------
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to