Hi All,

We're currently using BLCR to c/r jobs on our academic cluster and after some 
work, it 
works great for the things that it can do, but recently the BL lost funding to 
support it 
and it will no longer run with new kernels.

So we're looking for a new c/r infrastructure and a user from NYU rec'ed DMTCP. 
 I used 
the version in the Ubuntu repository, but it would checkpoint OK, but would 
refuse to 
restart.

I then downloaded the latest stable code, which compiled perfectly, as far as I 
could see, 
and installed it. It passed all the tests but one:
================================
== Tests == 


epoll1         FAILED               root-pids: [5791] msg: user program startup 
error, 2 
expected, 1 found, running=1 ..

== Summary == stunted: 58 of 59 tests passed

================================


The new one had a similar problem in that it would checkpoint, but not restart.

However, this was tested on my laptop and when I moved to my office from home 
(causing a network reset), and tried to restart the job one last time, the job 
DID restart, 
but lost the redirection to STDOUT, and dumped a lot of output and finally 
segfaulted 
(which the naked job does not do.)  Tried it 2x with the same result.

With the current (latest stable, varsion  2.5.2) version, still on my laptop
*$ *uname -a                                                                    
                              

This is the sequence of events:

In term1:

*$ *ps aux | grep -i dctc[p]                                                    
                            
(nothing)

*$ *export DMTCP_CHECKPOINT_INTERVAL=30 
*$ *export DMTCP_PORT=8889 
*$ *export DMTCP_GZIP=0 
*$ *export DMTCP_CHECKPOINT_DIR=/tmp/dmtcp 
*$ *dmtcp_coordinator  


*$ *cd                                                                          
                             $* *export 
DMTCP_CHECKPOINT_INTERVAL=30                                                    
                   
*$ *export DMTCP_PORT=8889 
*$ *export DMTCP_GZIP=0 
*$ *export DMTCP_CHECKPOINT_DIR=/tmp/dmtcp 
*$ *ps aux | grep dmtcp                                                         
                             hjm        547  0.0  
0.0  26376  3932 pts/1    S+   15:07   0:00 dmtcp_coordinator 

*$ *sleep 1                                                                     
                             *$ *rm -f /tmp/dmtcp/
*                                                                               
        *$ *dmtcp_launch tacg -n6 -slLc -S -F2 
< chr1.fa > junk/jj &                              
*$ *tacgid=$!                                                                   
                             *$ *echo "captured 
PID = $tacgid" 
*$ *sleep 5                                                                     
                              
(stderr starts...)

*$ *dmtcp_command -c                                                            
                             *$ *sleep 5                                        
                                                          
*$ *ls -l /tmp/dmtcp                                                            
                              
*dmtcp_restart_script.sh* -> *dmtcp_restart_script_21af429c2e5d9c5-400*
*                                                                               
           -rwxrw-r-- 1 hjm hjm     12416 May 17 
15:09 *dmtcp_restart_script_21af429c2e5d9c5-40000-65332637bcc1.sh** 
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Dmtcp-forum mailing list
Dmtcp-forum@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dmtcp-forum

Reply via email to