Re: [OMPI devel] OMPI-MIGRATE error

Joshua Hursey Mon, 31 Jan 2011 09:46:23 -0500

On Jan 31, 2011, at 6:47 AM, Hugo Meyer wrote:

> Hi Joshua.
> 
> I've tried the migration again, and i get the next (running process where 
> mpirun is running):
> 
> Terminal 1:
> 
> [hmeyer@clus9 whoami]$ /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun 
> -np 2 -am ft-enable-cr-recovery --mca orte_base_help_aggregate 0 ./whoami 10 
> 10
> Antes de MPI_Init
> Antes de MPI_Init
> --------------------------------------------------------------------------
> Warning: Could not find any processes to migrate on the nodes specified.
>          You provided the following:
> Nodes: node9
> Procs: (null)
> --------------------------------------------------------------------------
> Soy el número 1 (100000000)
> Terminando, una instrucción antes del finalize
> Soy el número 0 (100000000)
> Terminando, una instrucción antes del finalize
> 
> Terminal 2:
> 
> [hmeyer@clus9 build]$ 
> /home/hmeyer/desarrollo/ompi-code/binarios/bin/ompi-migrate -x node9 -t node3 
> 11724
> --------------------------------------------------------------------------
> Error: The Job identified by PID (11724) was not able to migrate processes in 
> this
>        job. This could be caused by any of the following:
>        - Invalid node or rank specified
>        - No processes on the indicated node can by migrated
>        - Process migration was not enabled for this job. Make sure to indicate
>          the proper AMCA file: "-am ft-enable-cr-recovery".
> --------------------------------------------------------------------------


The error message indicates that there were no processes found on 'node9'. Did 
you confirm that there were processes running on that node?

It is possible that the node name that Open MPI is using is different than what 
you put in. For example it could be fully qualified (e.g., 
node9.my.domain.com). So you might try that too. MPI_Get_processor_name() 
should return the name of the node that we are attempting to use. So you could 
have all processes print that out when the startup.


> Then i try another way, and i get the next:
> 
> Terminal 1:
> 
> [hmeyer@clus9 whoami]$ /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun 
> -np 3 -am ft-enable-cr-recovery ./whoami 10 10
> Antes de MPI_Init
> Antes de MPI_Init
> Antes de MPI_Init
> --------------------------------------------------------------------------
> Notice: A migration of this job has been requested.
>         The processes below will be migrated.
>         Please standby.
>       [[40382,1],1] Rank 1 on Node clus9
> 
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> Error: The process below has failed. There is no checkpoint available for
>        this job, so we are terminating the application since automatic
>        recovery cannot occur.
> Internal Name: [[40382,1],1]
> MCW Rank: 1
> 
> --------------------------------------------------------------------------
> Soy el número 0 (100000000)
> Terminando, una instrucción antes del finalize
> Soy el número 2 (100000000)
> Terminando, una instrucción antes del finalize
> 
> Terminal 2:
> 
> [hmeyer@clus9 build]$ 
> /home/hmeyer/desarrollo/ompi-code/binarios/bin/ompi-migrate -r 1 -t node3 
> 11784
> [clus9:11795] *** Process received signal ***
> [clus9:11795] Signal: Segmentation fault (11)
> [clus9:11795] Signal code: Address not mapped (1)
> [clus9:11795] Failing at address: (nil)
> [clus9:11795] [ 0] /lib64/libpthread.so.0 [0x2aaaac0b9d40]
> [clus9:11795] *** End of error message ***
> Segmentation fault

Humm. Well that's not good. It looks like the automatic recovery is jumping in 
while migrating, which should not be happening. I'll take a look and see if I 
can reproduce locally.

Thanks,
Josh

> 
> I'm using the ompi-migrate command in the right way? or i am missing 
> something? Because the first attempt didn't find any process.
> 
> Best Regards.
> 
> Hugo Meyer
> 
> 
> 2011/1/28 Hugo Meyer <[email protected]>
> Thanks to you Joshua.
> 
> I will try the procedure with this modifications and i will let you know how 
> it goes.
> 
> Best Regards.
> 
> Hugo Meyer
> 
> 2011/1/27 Joshua Hursey <[email protected]>
> 
> I believe that this is now fixed on the trunk. All the details are in the 
> commit message:
>  https://svn.open-mpi.org/trac/ompi/changeset/24317
> 
> In my testing yesterday, I did not test the scenario where the node with 
> mpirun also contains processes (the test cluster I was using does not by 
> default run this way). So I was able to reproduce by running on a single 
> node. There were a couple bugs that emerged that are fixed in the commit. The 
> two bugs that were hurting you were the TCP socket cleanup (which caused the 
> looping of the automatic recovery), and the incorrect accounting of local 
> process termination (which caused the modex errors).
> 
> Let me know if that fixes the problems that you were seeing.
> 
> Thanks for the bug report and your patience while I pursued a fix.
> 
> -- Josh
> 
> On Jan 27, 2011, at 11:28 AM, Hugo Meyer wrote:
> 
> > Hi Josh.
> >
> > Thanks for your reply. I'll tell you what i'm getting now from the 
> > executions in the next lines.
> > When i run without doing a checkpoint i get this output, and the process 
> > don' finish:
> >
> > [hmeyer@clus9 whoami]$ 
> > /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun -np 2 -am 
> > ft-enable-cr-recovery ./whoami 10 10
> > Antes de MPI_Init
> > Antes de MPI_Init
> > [clus9:04985] [[41167,0],0] ORTE_ERROR_LOG: Error in file 
> > ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287
> > [clus9:04985] [[41167,0],0] ORTE_ERROR_LOG: Error in file 
> > ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287
> > Soy el número 1 (100000000)
> > Terminando, una instrucción antes del finalize
> > Soy el número 0 (100000000)
> > Terminando, una instrucción antes del finalize
> > --------------------------------------------------------------------------
> > Error: The process below has failed. There is no checkpoint available for
> >        this job, so we are terminating the application since automatic
> >        recovery cannot occur.
> > Internal Name: [[41167,1],0]
> > MCW Rank: 0
> >
> > --------------------------------------------------------------------------
> > [clus9:04985] 1 more process has sent help message help-orte-errmgr-hnp.txt 
> > / autor_failed_to_recover_proc
> > [clus9:04985] Set MCA parameter "orte_base_help_aggregate" to 0 to see all 
> > help / error messages
> >
> > If i make a checkpoint in another terminal of the mpirun process, during 
> > the execution, i get this output:
> >
> > [clus9:06105] [[42095,0],0] ORTE_ERROR_LOG: Error in file 
> > ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287
> > [clus9:06105] [[42095,0],0] ORTE_ERROR_LOG: Error in file 
> > ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287
> > [clus9:06107] [[42095,1],1] ORTE_ERROR_LOG: Data unpack would read past end 
> > of buffer in file ../../../../orte/mca/grpcomm/base/grpcomm_base_modex.c at 
> > line 350
> > [clus9:06107] [[42095,1],1] ORTE_ERROR_LOG: Data unpack would read past end 
> > of buffer in file ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c 
> > at line 323
> > [clus9:06107] pml:ob1: ft_event(Restart): Failed orte_grpcomm.modex() = -26
> > [clus9:06106] [[42095,1],0] ORTE_ERROR_LOG: Data unpack would read past end 
> > of buffer in file ../../../../orte/mca/grpcomm/base/grpcomm_base_modex.c at 
> > line 350
> > [clus9:06106] [[42095,1],0] ORTE_ERROR_LOG: Data unpack would read past end 
> > of buffer in file ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c 
> > at line 323
> > [clus9:06106] pml:ob1: ft_event(Restart): Failed orte_grpcomm.modex() = -26
> > --------------------------------------------------------------------------
> > Notice: The job has been successfully recovered from the
> >         last checkpoint.
> > --------------------------------------------------------------------------
> > Soy el número 1 (100000000)
> > Terminando, una instrucción antes del finalize
> > Soy el número 0 (100000000)
> > Terminando, una instrucción antes del finalize
> > [clus9:06105] 1 more process has sent help message help-orte-errmgr-hnp.txt 
> > / autor_recovering_job
> > [clus9:06105] Set MCA parameter "orte_base_help_aggregate" to 0 to see all 
> > help / error messages
> > [clus9:06105] [[42095,0],0] ORTE_ERROR_LOG: Error in file 
> > ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287
> > [clus9:06105] [[42095,0],0] ORTE_ERROR_LOG: Error in file 
> > ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287
> > [clus9:06107] [[42095,1],1] ORTE_ERROR_LOG: Data unpack would read past end 
> > of buffer in file ../../../../orte/mca/grpcomm/base/grpcomm_base_modex.c at 
> > line 350
> > [clus9:06107] [[42095,1],1] ORTE_ERROR_LOG: Data unpack would read past end 
> > of buffer in file ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c 
> > at line 323
> > [clus9:06107] pml:ob1: ft_event(Restart): Failed orte_grpcomm.modex() = -26
> > [clus9:06106] [[42095,1],0] ORTE_ERROR_LOG: Data unpack would read past end 
> > of buffer in file ../../../../orte/mca/grpcomm/base/grpcomm_base_modex.c at 
> > line 350
> > [clus9:06106] [[42095,1],0] ORTE_ERROR_LOG: Data unpack would read past end 
> > of buffer in file ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c 
> > at line 323
> > [clus9:06106] pml:ob1: ft_event(Restart): Failed orte_grpcomm.modex() = -26
> > [clus9:06105] 1 more process has sent help message help-orte-errmgr-hnp.txt 
> > / autor_recovery_complete
> > Soy el número 0 (100000000)
> > Terminando, una instrucción antes del finalize
> > Soy el número 1 (100000000)
> > Terminando, una instrucción antes del finalize
> > [clus9:06105] 1 more process has sent help message help-orte-errmgr-hnp.txt 
> > / autor_recovering_job
> > [clus9:06105] [[42095,0],0] ORTE_ERROR_LOG: Error in file 
> > ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287
> > [clus9:06105] [[42095,0],0] ORTE_ERROR_LOG: Error in file 
> > ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287
> > [clus9:06107] [[42095,1],1] ORTE_ERROR_LOG: Data unpack would read past end 
> > of buffer in file ../../../../orte/mca/grpcomm/base/grpcomm_base_modex.c at 
> > line 350
> > [clus9:06107] [[42095,1],1] ORTE_ERROR_LOG: Data unpack would read past end 
> > of buffer in file ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c 
> > at line 323
> > [clus9:06106] [[42095,1],0] ORTE_ERROR_LOG: Data unpack would read past end 
> > of buffer in file ../../../../orte/mca/grpcomm/base/grpcomm_base_modex.c at 
> > line 350
> > [clus9:06106] [[42095,1],0] ORTE_ERROR_LOG: Data unpack would read past end 
> > of buffer in file ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c 
> > at line 323
> > [clus9:06106] pml:ob1: ft_event(Restart): Failed orte_grpcomm.modex() = -26
> > [clus9:06107] pml:ob1: ft_event(Restart): Failed orte_grpcomm.modex() = -26
> >
> > As you can see, it keeps looping on the recover. Then when i try to migrate 
> > this processes using ompi-migrate, i get this:
> >
> > [hmeyer@clus9 ~]$ 
> > /home/hmeyer/desarrollo/ompi-code/binarios/bin/ompi-migrate -x node9 -t 
> > node3 18082
> > --------------------------------------------------------------------------
> > Error: The Job identified by PID (18082) was not able to migrate processes 
> > in this
> >        job. This could be caused by any of the following:
> >        - Invalid node or rank specified
> >        - No processes on the indicated node can by migrated
> >        - Process migration was not enabled for this job. Make sure to 
> > indicate
> >          the proper AMCA file: "-am ft-enable-cr-recovery".
> > --------------------------------------------------------------------------
> > But, in the terminal where is running the application i get this:
> >
> > [hmeyer@clus9 whoami]$ 
> > /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun -np 2 -am 
> > ft-enable-cr-recovery ./whoami 10 10
> > Antes de MPI_Init
> > Antes de MPI_Init
> > [clus9:18082] [[62740,0],0] ORTE_ERROR_LOG: Error in file 
> > ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287
> > [clus9:18082] [[62740,0],0] ORTE_ERROR_LOG: Error in file 
> > ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287
> > --------------------------------------------------------------------------
> > Warning: Could not find any processes to migrate on the nodes specified.
> >          You provided the following:
> > Nodes: node9
> > Procs: (null)
> > --------------------------------------------------------------------------
> > --------------------------------------------------------------------------
> > Notice: The processes have been successfully migrated to/from the specified
> >         machines.
> > --------------------------------------------------------------------------
> > Soy el número 1 (100000000)
> > Terminando, una instrucción antes del finalize
> > Soy el número 0 (100000000)
> > Terminando, una instrucción antes del finalize
> > --------------------------------------------------------------------------
> > Error: The process below has failed. There is no checkpoint available for
> >        this job, so we are terminating the application since automatic
> >        recovery cannot occur.
> > Internal Name: [[62740,1],0]
> > MCW Rank: 0
> >
> > --------------------------------------------------------------------------
> > [clus9:18082] 1 more process has sent help message help-orte-errmgr-hnp.txt 
> > / autor_failed_to_recover_proc
> > [clus9:18082] Set MCA parameter "orte_base_help_aggregate" to 0 to see all 
> > help / error messages
> >
> > I asume that the orte_get_job_data_object is the problem, because it is not 
> > obtaining the proper value.
> >
> > If you need more data, just let me know.
> >
> > Best Regards.
> >
> > Hugo Meyer
> >
> >
> >
> >
> > 2011/1/26 Joshua Hursey <[email protected]>
> > I found a few more bugs after testing the C/R functionality this morning. I 
> > just committed some more C/R fixes in r24306 (things are now working 
> > correctly on my test cluster).
> >  https://svn.open-mpi.org/trac/ompi/changeset/24306
> >
> > One thing I just noticed in your original email was that you are specifying 
> > the wrong parameter for migration (it is different than the standard C/R 
> > functionality for backwards compatibility reasons). You need to use the 
> > 'ft-enable-cr-recovery' AMCA parameter:
> >  mpirun -np 2 -am ft-enable-cr-recovery ./whoami 10 10
> >
> > If you still get the segmentation fault after upgrading to the current 
> > trunk, can you send me a backtrace from the core file? That will help me 
> > narrow down on the problem.
> >
> > Thanks,
> > Josh
> >
> >
> > On Jan 26, 2011, at 8:40 AM, Hugo Meyer wrote:
> >
> > > Josh.
> > >
> > > The ompi-checkpoint with his restart now are working great, but the same 
> > > error persist with ompi-migrate. I've also tried using "-r", but i get 
> > > the same error.
> > >
> > > Best regards.
> > >
> > > Hugo Meyer
> > >
> > > 2011/1/26 Hugo Meyer <[email protected]>
> > > Thanks Josh.
> > >
> > > I've already check te prelink and is set to "no".
> > >
> > > I'm going to try with the trunk head, and then i'll let you know how it 
> > > goes.
> > >
> > > Best regards.
> > >
> > > Hugo Meyer
> > >
> > > 2011/1/25 Joshua Hursey <[email protected]>
> > >
> > > Can you try with the current trunk head (r24296)?
> > > I just committed a fix for the C/R functionality in which restarts were 
> > > getting stuck. This will likely affect the migration functionality, but I 
> > > have not had an opportunity to test just yet.
> > >
> > > Another thing to check is that prelink is turned off on all of your 
> > > machines.
> > >  https://upc-bugs.lbl.gov//blcr/doc/html/FAQ.html#prelink
> > >
> > > Let me know if the problem persists, and I'll dig into a bit more.
> > >
> > > Thanks,
> > > Josh
> > >
> > > On Jan 24, 2011, at 11:37 AM, Hugo Meyer wrote:
> > >
> > > > Hello @ll
> > > >
> > > > I've got a problem when i try to use the ompi-migrate command.
> > > >
> > > > What i'm doing is execute for example the next application in one node 
> > > > of a cluster (both process wil run on the same node):
> > > >
> > > > mpirun -np 2 -am ft-enable-cr ./whoami 10 10
> > > >
> > > > Then in the same node i try to migrate the processes to another node:
> > > >
> > > > ompi-migrate -x node9 -t node3 14914
> > > >
> > > > And then i get this message:
> > > >
> > > > [clus9:15620] *** Process received signal ***
> > > > [clus9:15620] Signal: Segmentation fault (11)
> > > > [clus9:15620] Signal code: Address not mapped (1)
> > > > [clus9:15620] Failing at address: (nil)
> > > > [clus9:15620] [ 0] /lib64/libpthread.so.0 [0x2aaaac0b8d40]
> > > > [clus9:15620] *** End of error message ***
> > > > Segmentation fault
> > > >
> > > > I assume that maybe there is something wrong with the thread level, but 
> > > > i have configured the open-mpi like this:
> > > >
> > > > ../configure --prefix=/home/hmeyer/desarrollo/ompi-code/binarios/ 
> > > > --enable-debug --enable-debug-symbols --enable-trace --with-ft=cr 
> > > > --disable-ipv6 --enable-opal-multi-threads --enable-ft-thread 
> > > > --without-hwloc --disable-vt --with-blcr=/soft/blcr-0.8.2/ 
> > > > --with-blcr-libdir=/soft/blcr-0.8.2/lib/
> > > >
> > > > The checkpoint and restart works fine, but when i restore an 
> > > > application that has more than one process, this one is restored and 
> > > > executed until the last line before MPI_FINALIZE(), but the processes 
> > > > never finalize, i assume that they never call the MPI_FINALIZE(), but 
> > > > with one process ompi-checkpoint and ompi-restart work great.
> > > >
> > > > Best regards.
> > > >
> > > > Hugo Meyer
> > > > _______________________________________________
> > > > devel mailing list
> > > > [email protected]
> > > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > >
> > > ------------------------------------
> > > Joshua Hursey
> > > Postdoctoral Research Associate
> > > Oak Ridge National Laboratory
> > > http://users.nccs.gov/~jjhursey
> > >
> > >
> > > _______________________________________________
> > > devel mailing list
> > > [email protected]
> > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > >
> > >
> > > _______________________________________________
> > > devel mailing list
> > > [email protected]
> > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
> > ------------------------------------
> > Joshua Hursey
> > Postdoctoral Research Associate
> > Oak Ridge National Laboratory
> > http://users.nccs.gov/~jjhursey
> >
> >
> > _______________________________________________
> > devel mailing list
> > [email protected]
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
> > _______________________________________________
> > devel mailing list
> > [email protected]
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ------------------------------------
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey
> 
> 
> _______________________________________________
> devel mailing list
> [email protected]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> 
> _______________________________________________
> devel mailing list
> [email protected]
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

------------------------------------
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey

Re: [OMPI devel] OMPI-MIGRATE error

Reply via email to