Hi All,
We're currently using BLCR to c/r jobs on our academic cluster and after some
work, it
works great for the things that it can do, but recently the BL lost funding to
support it
and it will no longer run with new kernels.
So we're looking for a new c/r infrastructure and a user from NYU rec'ed DMTCP.
I used
the version in the Ubuntu repository, but it would checkpoint OK, but would
refuse to
restart.
I then downloaded the latest stable code, which compiled perfectly, as far as I
could see,
and installed it. It passed all the tests but one:
================================
== Tests ==
epoll1 FAILED root-pids: [5791] msg: user program startup
error, 2
expected, 1 found, running=1 ..
== Summary == stunted: 58 of 59 tests passed
================================
The new one had a similar problem in that it would checkpoint, but not restart.
However, this was tested on my laptop and when I moved to my office from home
(causing a network reset), and tried to restart the job one last time, the job
DID restart,
but lost the redirection to STDOUT, and dumped a lot of output and finally
segfaulted
(which the naked job does not do.) Tried it 2x with the same result.
With the current (latest stable, varsion 2.5.2) version, still on my laptop
*$ *uname -a
This is the sequence of events:
In term1:
*$ *ps aux | grep -i dctc[p]
(nothing)
*$ *export DMTCP_CHECKPOINT_INTERVAL=30
*$ *export DMTCP_PORT=8889
*$ *export DMTCP_GZIP=0
*$ *export DMTCP_CHECKPOINT_DIR=/tmp/dmtcp
*$ *dmtcp_coordinator
*$ *cd
$* *export
DMTCP_CHECKPOINT_INTERVAL=30
*$ *export DMTCP_PORT=8889
*$ *export DMTCP_GZIP=0
*$ *export DMTCP_CHECKPOINT_DIR=/tmp/dmtcp
*$ *ps aux | grep dmtcp
hjm 547 0.0
0.0 26376 3932 pts/1 S+ 15:07 0:00 dmtcp_coordinator
*$ *sleep 1
*$ *rm -f /tmp/dmtcp/
*
*$ *dmtcp_launch tacg -n6 -slLc -S -F2
< chr1.fa > junk/jj &
*$ *tacgid=$!
*$ *echo "captured
PID = $tacgid"
*$ *sleep 5
(stderr starts...)
*$ *dmtcp_command -c
*$ *sleep 5
*$ *ls -l /tmp/dmtcp
*dmtcp_restart_script.sh* -> *dmtcp_restart_script_21af429c2e5d9c5-400*
*
-rwxrw-r-- 1 hjm hjm 12416 May 17
15:09 *dmtcp_restart_script_21af429c2e5d9c5-40000-65332637bcc1.sh**
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum