Thanks to you Joshua. I will try the procedure with this modifications and i will let you know how it goes.
Best Regards. Hugo Meyer 2011/1/27 Joshua Hursey <jjhur...@open-mpi.org> > I believe that this is now fixed on the trunk. All the details are in the > commit message: > https://svn.open-mpi.org/trac/ompi/changeset/24317 > > In my testing yesterday, I did not test the scenario where the node with > mpirun also contains processes (the test cluster I was using does not by > default run this way). So I was able to reproduce by running on a single > node. There were a couple bugs that emerged that are fixed in the commit. > The two bugs that were hurting you were the TCP socket cleanup (which caused > the looping of the automatic recovery), and the incorrect accounting of > local process termination (which caused the modex errors). > > Let me know if that fixes the problems that you were seeing. > > Thanks for the bug report and your patience while I pursued a fix. > > -- Josh > > On Jan 27, 2011, at 11:28 AM, Hugo Meyer wrote: > > > Hi Josh. > > > > Thanks for your reply. I'll tell you what i'm getting now from the > executions in the next lines. > > When i run without doing a checkpoint i get this output, and the process > don' finish: > > > > [hmeyer@clus9 whoami]$ > /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun -np 2 -am > ft-enable-cr-recovery ./whoami 10 10 > > Antes de MPI_Init > > Antes de MPI_Init > > [clus9:04985] [[41167,0],0] ORTE_ERROR_LOG: Error in file > ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287 > > [clus9:04985] [[41167,0],0] ORTE_ERROR_LOG: Error in file > ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287 > > Soy el número 1 (100000000) > > Terminando, una instrucción antes del finalize > > Soy el número 0 (100000000) > > Terminando, una instrucción antes del finalize > > > -------------------------------------------------------------------------- > > Error: The process below has failed. There is no checkpoint available for > > this job, so we are terminating the application since automatic > > recovery cannot occur. > > Internal Name: [[41167,1],0] > > MCW Rank: 0 > > > > > -------------------------------------------------------------------------- > > [clus9:04985] 1 more process has sent help message > help-orte-errmgr-hnp.txt / autor_failed_to_recover_proc > > [clus9:04985] Set MCA parameter "orte_base_help_aggregate" to 0 to see > all help / error messages > > > > If i make a checkpoint in another terminal of the mpirun process, during > the execution, i get this output: > > > > [clus9:06105] [[42095,0],0] ORTE_ERROR_LOG: Error in file > ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287 > > [clus9:06105] [[42095,0],0] ORTE_ERROR_LOG: Error in file > ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287 > > [clus9:06107] [[42095,1],1] ORTE_ERROR_LOG: Data unpack would read past > end of buffer in file ../../../../orte/mca/grpcomm/base/grpcomm_base_modex.c > at line 350 > > [clus9:06107] [[42095,1],1] ORTE_ERROR_LOG: Data unpack would read past > end of buffer in file > ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c at line 323 > > [clus9:06107] pml:ob1: ft_event(Restart): Failed orte_grpcomm.modex() = > -26 > > [clus9:06106] [[42095,1],0] ORTE_ERROR_LOG: Data unpack would read past > end of buffer in file ../../../../orte/mca/grpcomm/base/grpcomm_base_modex.c > at line 350 > > [clus9:06106] [[42095,1],0] ORTE_ERROR_LOG: Data unpack would read past > end of buffer in file > ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c at line 323 > > [clus9:06106] pml:ob1: ft_event(Restart): Failed orte_grpcomm.modex() = > -26 > > > -------------------------------------------------------------------------- > > Notice: The job has been successfully recovered from the > > last checkpoint. > > > -------------------------------------------------------------------------- > > Soy el número 1 (100000000) > > Terminando, una instrucción antes del finalize > > Soy el número 0 (100000000) > > Terminando, una instrucción antes del finalize > > [clus9:06105] 1 more process has sent help message > help-orte-errmgr-hnp.txt / autor_recovering_job > > [clus9:06105] Set MCA parameter "orte_base_help_aggregate" to 0 to see > all help / error messages > > [clus9:06105] [[42095,0],0] ORTE_ERROR_LOG: Error in file > ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287 > > [clus9:06105] [[42095,0],0] ORTE_ERROR_LOG: Error in file > ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287 > > [clus9:06107] [[42095,1],1] ORTE_ERROR_LOG: Data unpack would read past > end of buffer in file ../../../../orte/mca/grpcomm/base/grpcomm_base_modex.c > at line 350 > > [clus9:06107] [[42095,1],1] ORTE_ERROR_LOG: Data unpack would read past > end of buffer in file > ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c at line 323 > > [clus9:06107] pml:ob1: ft_event(Restart): Failed orte_grpcomm.modex() = > -26 > > [clus9:06106] [[42095,1],0] ORTE_ERROR_LOG: Data unpack would read past > end of buffer in file ../../../../orte/mca/grpcomm/base/grpcomm_base_modex.c > at line 350 > > [clus9:06106] [[42095,1],0] ORTE_ERROR_LOG: Data unpack would read past > end of buffer in file > ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c at line 323 > > [clus9:06106] pml:ob1: ft_event(Restart): Failed orte_grpcomm.modex() = > -26 > > [clus9:06105] 1 more process has sent help message > help-orte-errmgr-hnp.txt / autor_recovery_complete > > Soy el número 0 (100000000) > > Terminando, una instrucción antes del finalize > > Soy el número 1 (100000000) > > Terminando, una instrucción antes del finalize > > [clus9:06105] 1 more process has sent help message > help-orte-errmgr-hnp.txt / autor_recovering_job > > [clus9:06105] [[42095,0],0] ORTE_ERROR_LOG: Error in file > ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287 > > [clus9:06105] [[42095,0],0] ORTE_ERROR_LOG: Error in file > ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287 > > [clus9:06107] [[42095,1],1] ORTE_ERROR_LOG: Data unpack would read past > end of buffer in file ../../../../orte/mca/grpcomm/base/grpcomm_base_modex.c > at line 350 > > [clus9:06107] [[42095,1],1] ORTE_ERROR_LOG: Data unpack would read past > end of buffer in file > ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c at line 323 > > [clus9:06106] [[42095,1],0] ORTE_ERROR_LOG: Data unpack would read past > end of buffer in file ../../../../orte/mca/grpcomm/base/grpcomm_base_modex.c > at line 350 > > [clus9:06106] [[42095,1],0] ORTE_ERROR_LOG: Data unpack would read past > end of buffer in file > ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c at line 323 > > [clus9:06106] pml:ob1: ft_event(Restart): Failed orte_grpcomm.modex() = > -26 > > [clus9:06107] pml:ob1: ft_event(Restart): Failed orte_grpcomm.modex() = > -26 > > > > As you can see, it keeps looping on the recover. Then when i try to > migrate this processes using ompi-migrate, i get this: > > > > [hmeyer@clus9 ~]$ > /home/hmeyer/desarrollo/ompi-code/binarios/bin/ompi-migrate -x node9 -t > node3 18082 > > > -------------------------------------------------------------------------- > > Error: The Job identified by PID (18082) was not able to migrate > processes in this > > job. This could be caused by any of the following: > > - Invalid node or rank specified > > - No processes on the indicated node can by migrated > > - Process migration was not enabled for this job. Make sure to > indicate > > the proper AMCA file: "-am ft-enable-cr-recovery". > > > -------------------------------------------------------------------------- > > But, in the terminal where is running the application i get this: > > > > [hmeyer@clus9 whoami]$ > /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun -np 2 -am > ft-enable-cr-recovery ./whoami 10 10 > > Antes de MPI_Init > > Antes de MPI_Init > > [clus9:18082] [[62740,0],0] ORTE_ERROR_LOG: Error in file > ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287 > > [clus9:18082] [[62740,0],0] ORTE_ERROR_LOG: Error in file > ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287 > > > -------------------------------------------------------------------------- > > Warning: Could not find any processes to migrate on the nodes specified. > > You provided the following: > > Nodes: node9 > > Procs: (null) > > > -------------------------------------------------------------------------- > > > -------------------------------------------------------------------------- > > Notice: The processes have been successfully migrated to/from the > specified > > machines. > > > -------------------------------------------------------------------------- > > Soy el número 1 (100000000) > > Terminando, una instrucción antes del finalize > > Soy el número 0 (100000000) > > Terminando, una instrucción antes del finalize > > > -------------------------------------------------------------------------- > > Error: The process below has failed. There is no checkpoint available for > > this job, so we are terminating the application since automatic > > recovery cannot occur. > > Internal Name: [[62740,1],0] > > MCW Rank: 0 > > > > > -------------------------------------------------------------------------- > > [clus9:18082] 1 more process has sent help message > help-orte-errmgr-hnp.txt / autor_failed_to_recover_proc > > [clus9:18082] Set MCA parameter "orte_base_help_aggregate" to 0 to see > all help / error messages > > > > I asume that the orte_get_job_data_object is the problem, because it is > not obtaining the proper value. > > > > If you need more data, just let me know. > > > > Best Regards. > > > > Hugo Meyer > > > > > > > > > > 2011/1/26 Joshua Hursey <jjhur...@open-mpi.org> > > I found a few more bugs after testing the C/R functionality this morning. > I just committed some more C/R fixes in r24306 (things are now working > correctly on my test cluster). > > https://svn.open-mpi.org/trac/ompi/changeset/24306 > > > > One thing I just noticed in your original email was that you are > specifying the wrong parameter for migration (it is different than the > standard C/R functionality for backwards compatibility reasons). You need to > use the 'ft-enable-cr-recovery' AMCA parameter: > > mpirun -np 2 -am ft-enable-cr-recovery ./whoami 10 10 > > > > If you still get the segmentation fault after upgrading to the current > trunk, can you send me a backtrace from the core file? That will help me > narrow down on the problem. > > > > Thanks, > > Josh > > > > > > On Jan 26, 2011, at 8:40 AM, Hugo Meyer wrote: > > > > > Josh. > > > > > > The ompi-checkpoint with his restart now are working great, but the > same error persist with ompi-migrate. I've also tried using "-r", but i get > the same error. > > > > > > Best regards. > > > > > > Hugo Meyer > > > > > > 2011/1/26 Hugo Meyer <meyer.h...@gmail.com> > > > Thanks Josh. > > > > > > I've already check te prelink and is set to "no". > > > > > > I'm going to try with the trunk head, and then i'll let you know how it > goes. > > > > > > Best regards. > > > > > > Hugo Meyer > > > > > > 2011/1/25 Joshua Hursey <jjhur...@open-mpi.org> > > > > > > Can you try with the current trunk head (r24296)? > > > I just committed a fix for the C/R functionality in which restarts were > getting stuck. This will likely affect the migration functionality, but I > have not had an opportunity to test just yet. > > > > > > Another thing to check is that prelink is turned off on all of your > machines. > > > https://upc-bugs.lbl.gov//blcr/doc/html/FAQ.html#prelink > > > > > > Let me know if the problem persists, and I'll dig into a bit more. > > > > > > Thanks, > > > Josh > > > > > > On Jan 24, 2011, at 11:37 AM, Hugo Meyer wrote: > > > > > > > Hello @ll > > > > > > > > I've got a problem when i try to use the ompi-migrate command. > > > > > > > > What i'm doing is execute for example the next application in one > node of a cluster (both process wil run on the same node): > > > > > > > > mpirun -np 2 -am ft-enable-cr ./whoami 10 10 > > > > > > > > Then in the same node i try to migrate the processes to another node: > > > > > > > > ompi-migrate -x node9 -t node3 14914 > > > > > > > > And then i get this message: > > > > > > > > [clus9:15620] *** Process received signal *** > > > > [clus9:15620] Signal: Segmentation fault (11) > > > > [clus9:15620] Signal code: Address not mapped (1) > > > > [clus9:15620] Failing at address: (nil) > > > > [clus9:15620] [ 0] /lib64/libpthread.so.0 [0x2aaaac0b8d40] > > > > [clus9:15620] *** End of error message *** > > > > Segmentation fault > > > > > > > > I assume that maybe there is something wrong with the thread level, > but i have configured the open-mpi like this: > > > > > > > > ../configure --prefix=/home/hmeyer/desarrollo/ompi-code/binarios/ > --enable-debug --enable-debug-symbols --enable-trace --with-ft=cr > --disable-ipv6 --enable-opal-multi-threads --enable-ft-thread > --without-hwloc --disable-vt --with-blcr=/soft/blcr-0.8.2/ > --with-blcr-libdir=/soft/blcr-0.8.2/lib/ > > > > > > > > The checkpoint and restart works fine, but when i restore an > application that has more than one process, this one is restored and > executed until the last line before MPI_FINALIZE(), but the processes never > finalize, i assume that they never call the MPI_FINALIZE(), but with one > process ompi-checkpoint and ompi-restart work great. > > > > > > > > Best regards. > > > > > > > > Hugo Meyer > > > > _______________________________________________ > > > > devel mailing list > > > > de...@open-mpi.org > > > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > > > ------------------------------------ > > > Joshua Hursey > > > Postdoctoral Research Associate > > > Oak Ridge National Laboratory > > > http://users.nccs.gov/~jjhursey > > > > > > > > > _______________________________________________ > > > devel mailing list > > > de...@open-mpi.org > > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > > > > > > _______________________________________________ > > > devel mailing list > > > de...@open-mpi.org > > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > ------------------------------------ > > Joshua Hursey > > Postdoctoral Research Associate > > Oak Ridge National Laboratory > > http://users.nccs.gov/~jjhursey > > > > > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > > > _______________________________________________ > > devel mailing list > > de...@open-mpi.org > > http://www.open-mpi.org/mailman/listinfo.cgi/devel > > ------------------------------------ > Joshua Hursey > Postdoctoral Research Associate > Oak Ridge National Laboratory > http://users.nccs.gov/~jjhursey > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >