I found a few more bugs after testing the C/R functionality this morning. I just committed some more C/R fixes in r24306 (things are now working correctly on my test cluster). https://svn.open-mpi.org/trac/ompi/changeset/24306
One thing I just noticed in your original email was that you are specifying the wrong parameter for migration (it is different than the standard C/R functionality for backwards compatibility reasons). You need to use the 'ft-enable-cr-recovery' AMCA parameter: mpirun -np 2 -am ft-enable-cr-recovery ./whoami 10 10 If you still get the segmentation fault after upgrading to the current trunk, can you send me a backtrace from the core file? That will help me narrow down on the problem. Thanks, Josh On Jan 26, 2011, at 8:40 AM, Hugo Meyer wrote: > Josh. > > The ompi-checkpoint with his restart now are working great, but the same > error persist with ompi-migrate. I've also tried using "-r", but i get the > same error. > > Best regards. > > Hugo Meyer > > 2011/1/26 Hugo Meyer <meyer.h...@gmail.com> > Thanks Josh. > > I've already check te prelink and is set to "no". > > I'm going to try with the trunk head, and then i'll let you know how it goes. > > Best regards. > > Hugo Meyer > > 2011/1/25 Joshua Hursey <jjhur...@open-mpi.org> > > Can you try with the current trunk head (r24296)? > I just committed a fix for the C/R functionality in which restarts were > getting stuck. This will likely affect the migration functionality, but I > have not had an opportunity to test just yet. > > Another thing to check is that prelink is turned off on all of your machines. > https://upc-bugs.lbl.gov//blcr/doc/html/FAQ.html#prelink > > Let me know if the problem persists, and I'll dig into a bit more. > > Thanks, > Josh > > On Jan 24, 2011, at 11:37 AM, Hugo Meyer wrote: > > > Hello @ll > > > > I've got a problem when i try to use the ompi-migrate command. > > > > What i'm doing is execute for example the next application in one node of a > > cluster (both process wil run on the same node): > > > > mpirun -np 2 -am ft-enable-cr ./whoami 10 10 > > > > Then in the same node i try to migrate the processes to another node: > > > > ompi-migrate -x node9 -t node3 14914 > > > > And then i get this message: > > > > [clus9:15620] *** Process received signal *** > > [clus9:15620] Signal: Segmentation fault (11) > > [clus9:15620] Signal code: Address not mapped (1) > > [clus9:15620] Failing at address: (nil) > > [clus9:15620] [ 0] /lib64/libpthread.so.0 [0x2aaaac0b8d40] > > [clus9:15620] *** End of error message *** > > Segmentation fault > > > > I assume that maybe there is something wrong with the thread level, but i > > have configured the open-mpi like this: > > > > ../configure --prefix=/home/hmeyer/desarrollo/ompi-code/binarios/ > > --enable-debug --enable-debug-symbols --enable-trace --with-ft=cr > > --disable-ipv6 --enable-opal-multi-threads --enable-ft-thread > > --without-hwloc --disable-vt --with-blcr=/soft/blcr-0.8.2/ > > --with-blcr-libdir=/soft/blcr-0.8.2/lib/ > > > > The checkpoint and restart works fine, but when i restore an application > > that has more than one process, this one is restored and executed until the > > last line before MPI_FINALIZE(), but the processes never finalize, i assume > > that they never call the MPI_FINALIZE(), but with one process > > ompi-checkpoint and ompi-restart work great. > > > > Best regards. > > > > Hugo Meyer > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > ------------------------------------ > Joshua Hursey > Postdoctoral Research Associate > Oak Ridge National Laboratory > http://users.nccs.gov/~jjhursey > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel ------------------------------------ Joshua Hursey Postdoctoral Research Associate Oak Ridge National Laboratory http://users.nccs.gov/~jjhursey