I've been playing around with some examples in contrib/python on the latest master.
This lead me to discover the minor issue fixed in 510c026. However, as I pressed further, I found that any checkpoints I tried to initiate within python were not finishing (the function call never returned). Through the wonders of git-bisect, I ended up squarely on 823d096. Upon inspection, I discovered: -void dmtcp::userHookTrampoline_postCkpt(bool isRestart) -{ - //this function runs before other threads are resumed - if(isRestart){ - numRestarts++; - if(userHookPostRestart != NULL) - (*userHookPostRestart)(); - }else{ - numCheckpoints++; - if(userHookPostCheckpoint != NULL) - (*userHookPostCheckpoint)(); - } -} The removal of userHookTrampoline_postCkpt also removes calls which increment the static variables numRestarts and numCheckpoints. These are used later in dmtcpplugin.cpp @ 113: if(dmtcpRunCommand('c')){ //request checkpoint //and wait for the checkpoint while(oldNumRestarts==numRestarts && oldNumCheckpoints==numCheckpoints){ //nanosleep should get interrupted by checkpointing with an EINTR error //though there is a race to get to nanosleep() before the checkpoint struct timespec t = {1,0}; nanosleep(&t, NULL); memfence(); //make sure the loop condition doesn't get optimized } rv = (oldNumRestarts==numRestarts ? DMTCP_AFTER_CHECKPOINT : DMTCP_AFTER_RESTART); } The logic here suggests that if either numRestarts or numCheckpoints is not incremented, you will be stuck in the while loop. Having grep'd the code, I can't find any other place where the incrementing happens. I also recall from running strace that my checkpoint was stuck making nanosleep calls. So, everything seems to confirm my suspicions. I'm not sure the proper solution. I added a quick snippet to dmtcpplugin.cpp: void dmtcp::increment_counters(bool isRestart) { if (isRestart) { numRestarts++; } else { numCheckpoints++; } } and called it from mtcpinterface.cpp: DmtcpWorker::waitForStage4Resume(isRestart); WorkerState::setCurrentState( WorkerState::RUNNING ); increment_counters(isRestart); if (dmtcp_is_ptracing == NULL || !dmtcp_is_ptracing()) { // Inform Coordinator of our RUNNING state; // If running under ptrace, lets do this in sleep-between-ckpt callback DmtcpWorker::informCoordinatorOfRUNNINGState(); } This did the trick, but I'm not sure it is the cleanest solution, or even what the authors intended. I would appreciate of someone can confirm this is an issue or otherwise let me know if I'm in error? I'm happy to help implement a fix, but I may need to study the code more to come up with something more appropriate than the above hack.
------------------------------------------------------------------------------
_______________________________________________ Dmtcp-forum mailing list Dmtcp-forum@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dmtcp-forum