Re: [OMPI devel] OMPI-MIGRATE error

Joshua Hursey Mon, 31 Jan 2011 10:39:31 -0500

So I was not able to reproduce this issue.

A couple notes:
 - You can see the node-to-process-rank mapping using the '-display-map' 
command line option to mpirun. This will give you the node names that Open MPI 
is using, and how it intends to layout the processes. You can use the 
'-display-allocation' option to see all of the nodes that Open MPI knows about. 
Open MPI cannot, currently, migrate to a node that it does not know about on 
startup.
 - If the problem persists, add the following MCA parameters to your 
~/.openmpi/mca-params.conf file and send me a zipped-up text file of the 
output. It might show us where things are going wrong:
----------------
orte_debug_daemons=1
errmgr_base_verbose=20
snapc_full_verbose=20
----------------


-- Josh

On Jan 31, 2011, at 9:46 AM, Joshua Hursey wrote:

> 
> On Jan 31, 2011, at 6:47 AM, Hugo Meyer wrote:
> 
>> Hi Joshua.
>> 
>> I've tried the migration again, and i get the next (running process where 
>> mpirun is running):
>> 
>> Terminal 1:
>> 
>> [hmeyer@clus9 whoami]$ /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun 
>> -np 2 -am ft-enable-cr-recovery --mca orte_base_help_aggregate 0 ./whoami 10 
>> 10
>> Antes de MPI_Init
>> Antes de MPI_Init
>> --------------------------------------------------------------------------
>> Warning: Could not find any processes to migrate on the nodes specified.
>>         You provided the following:
>> Nodes: node9
>> Procs: (null)
>> --------------------------------------------------------------------------
>> Soy el número 1 (100000000)
>> Terminando, una instrucción antes del finalize
>> Soy el número 0 (100000000)
>> Terminando, una instrucción antes del finalize
>> 
>> Terminal 2:
>> 
>> [hmeyer@clus9 build]$ 
>> /home/hmeyer/desarrollo/ompi-code/binarios/bin/ompi-migrate -x node9 -t 
>> node3 11724
>> --------------------------------------------------------------------------
>> Error: The Job identified by PID (11724) was not able to migrate processes 
>> in this
>>       job. This could be caused by any of the following:
>>       - Invalid node or rank specified
>>       - No processes on the indicated node can by migrated
>>       - Process migration was not enabled for this job. Make sure to indicate
>>         the proper AMCA file: "-am ft-enable-cr-recovery".
>> --------------------------------------------------------------------------
> 
> The error message indicates that there were no processes found on 'node9'. 
> Did you confirm that there were processes running on that node?
> 
> It is possible that the node name that Open MPI is using is different than 
> what you put in. For example it could be fully qualified (e.g., 
> node9.my.domain.com). So you might try that too. MPI_Get_processor_name() 
> should return the name of the node that we are attempting to use. So you 
> could have all processes print that out when the startup.
> 
> 
>> Then i try another way, and i get the next:
>> 
>> Terminal 1:
>> 
>> [hmeyer@clus9 whoami]$ /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun 
>> -np 3 -am ft-enable-cr-recovery ./whoami 10 10
>> Antes de MPI_Init
>> Antes de MPI_Init
>> Antes de MPI_Init
>> --------------------------------------------------------------------------
>> Notice: A migration of this job has been requested.
>>        The processes below will be migrated.
>>        Please standby.
>>      [[40382,1],1] Rank 1 on Node clus9
>> 
>> --------------------------------------------------------------------------
>> --------------------------------------------------------------------------
>> Error: The process below has failed. There is no checkpoint available for
>>       this job, so we are terminating the application since automatic
>>       recovery cannot occur.
>> Internal Name: [[40382,1],1]
>> MCW Rank: 1
>> 
>> --------------------------------------------------------------------------
>> Soy el número 0 (100000000)
>> Terminando, una instrucción antes del finalize
>> Soy el número 2 (100000000)
>> Terminando, una instrucción antes del finalize
>> 
>> Terminal 2:
>> 
>> [hmeyer@clus9 build]$ 
>> /home/hmeyer/desarrollo/ompi-code/binarios/bin/ompi-migrate -r 1 -t node3 
>> 11784
>> [clus9:11795] *** Process received signal ***
>> [clus9:11795] Signal: Segmentation fault (11)
>> [clus9:11795] Signal code: Address not mapped (1)
>> [clus9:11795] Failing at address: (nil)
>> [clus9:11795] [ 0] /lib64/libpthread.so.0 [0x2aaaac0b9d40]
>> [clus9:11795] *** End of error message ***
>> Segmentation fault
> 
> Humm. Well that's not good. It looks like the automatic recovery is jumping 
> in while migrating, which should not be happening. I'll take a look and see 
> if I can reproduce locally.
> 
> Thanks,
> Josh
> 
>> 
>> I'm using the ompi-migrate command in the right way? or i am missing 
>> something? Because the first attempt didn't find any process.
>> 
>> Best Regards.
>> 
>> Hugo Meyer
>> 
>> 
>> 2011/1/28 Hugo Meyer <meyer.h...@gmail.com>
>> Thanks to you Joshua.
>> 
>> I will try the procedure with this modifications and i will let you know how 
>> it goes.
>> 
>> Best Regards.
>> 
>> Hugo Meyer
>> 
>> 2011/1/27 Joshua Hursey <jjhur...@open-mpi.org>
>> 
>> I believe that this is now fixed on the trunk. All the details are in the 
>> commit message:
>> https://svn.open-mpi.org/trac/ompi/changeset/24317
>> 
>> In my testing yesterday, I did not test the scenario where the node with 
>> mpirun also contains processes (the test cluster I was using does not by 
>> default run this way). So I was able to reproduce by running on a single 
>> node. There were a couple bugs that emerged that are fixed in the commit. 
>> The two bugs that were hurting you were the TCP socket cleanup (which caused 
>> the looping of the automatic recovery), and the incorrect accounting of 
>> local process termination (which caused the modex errors).
>> 
>> Let me know if that fixes the problems that you were seeing.
>> 
>> Thanks for the bug report and your patience while I pursued a fix.
>> 
>> -- Josh
>> 
>> On Jan 27, 2011, at 11:28 AM, Hugo Meyer wrote:
>> 
>>> Hi Josh.
>>> 
>>> Thanks for your reply. I'll tell you what i'm getting now from the 
>>> executions in the next lines.
>>> When i run without doing a checkpoint i get this output, and the process 
>>> don' finish:
>>> 
>>> [hmeyer@clus9 whoami]$ 
>>> /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun -np 2 -am 
>>> ft-enable-cr-recovery ./whoami 10 10
>>> Antes de MPI_Init
>>> Antes de MPI_Init
>>> [clus9:04985] [[41167,0],0] ORTE_ERROR_LOG: Error in file 
>>> ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287
>>> [clus9:04985] [[41167,0],0] ORTE_ERROR_LOG: Error in file 
>>> ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287
>>> Soy el número 1 (100000000)
>>> Terminando, una instrucción antes del finalize
>>> Soy el número 0 (100000000)
>>> Terminando, una instrucción antes del finalize
>>> --------------------------------------------------------------------------
>>> Error: The process below has failed. There is no checkpoint available for
>>>       this job, so we are terminating the application since automatic
>>>       recovery cannot occur.
>>> Internal Name: [[41167,1],0]
>>> MCW Rank: 0
>>> 
>>> --------------------------------------------------------------------------
>>> [clus9:04985] 1 more process has sent help message help-orte-errmgr-hnp.txt 
>>> / autor_failed_to_recover_proc
>>> [clus9:04985] Set MCA parameter "orte_base_help_aggregate" to 0 to see all 
>>> help / error messages
>>> 
>>> If i make a checkpoint in another terminal of the mpirun process, during 
>>> the execution, i get this output:
>>> 
>>> [clus9:06105] [[42095,0],0] ORTE_ERROR_LOG: Error in file 
>>> ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287
>>> [clus9:06105] [[42095,0],0] ORTE_ERROR_LOG: Error in file 
>>> ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287
>>> [clus9:06107] [[42095,1],1] ORTE_ERROR_LOG: Data unpack would read past end 
>>> of buffer in file ../../../../orte/mca/grpcomm/base/grpcomm_base_modex.c at 
>>> line 350
>>> [clus9:06107] [[42095,1],1] ORTE_ERROR_LOG: Data unpack would read past end 
>>> of buffer in file ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c 
>>> at line 323
>>> [clus9:06107] pml:ob1: ft_event(Restart): Failed orte_grpcomm.modex() = -26
>>> [clus9:06106] [[42095,1],0] ORTE_ERROR_LOG: Data unpack would read past end 
>>> of buffer in file ../../../../orte/mca/grpcomm/base/grpcomm_base_modex.c at 
>>> line 350
>>> [clus9:06106] [[42095,1],0] ORTE_ERROR_LOG: Data unpack would read past end 
>>> of buffer in file ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c 
>>> at line 323
>>> [clus9:06106] pml:ob1: ft_event(Restart): Failed orte_grpcomm.modex() = -26
>>> --------------------------------------------------------------------------
>>> Notice: The job has been successfully recovered from the
>>>        last checkpoint.
>>> --------------------------------------------------------------------------
>>> Soy el número 1 (100000000)
>>> Terminando, una instrucción antes del finalize
>>> Soy el número 0 (100000000)
>>> Terminando, una instrucción antes del finalize
>>> [clus9:06105] 1 more process has sent help message help-orte-errmgr-hnp.txt 
>>> / autor_recovering_job
>>> [clus9:06105] Set MCA parameter "orte_base_help_aggregate" to 0 to see all 
>>> help / error messages
>>> [clus9:06105] [[42095,0],0] ORTE_ERROR_LOG: Error in file 
>>> ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287
>>> [clus9:06105] [[42095,0],0] ORTE_ERROR_LOG: Error in file 
>>> ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287
>>> [clus9:06107] [[42095,1],1] ORTE_ERROR_LOG: Data unpack would read past end 
>>> of buffer in file ../../../../orte/mca/grpcomm/base/grpcomm_base_modex.c at 
>>> line 350
>>> [clus9:06107] [[42095,1],1] ORTE_ERROR_LOG: Data unpack would read past end 
>>> of buffer in file ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c 
>>> at line 323
>>> [clus9:06107] pml:ob1: ft_event(Restart): Failed orte_grpcomm.modex() = -26
>>> [clus9:06106] [[42095,1],0] ORTE_ERROR_LOG: Data unpack would read past end 
>>> of buffer in file ../../../../orte/mca/grpcomm/base/grpcomm_base_modex.c at 
>>> line 350
>>> [clus9:06106] [[42095,1],0] ORTE_ERROR_LOG: Data unpack would read past end 
>>> of buffer in file ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c 
>>> at line 323
>>> [clus9:06106] pml:ob1: ft_event(Restart): Failed orte_grpcomm.modex() = -26
>>> [clus9:06105] 1 more process has sent help message help-orte-errmgr-hnp.txt 
>>> / autor_recovery_complete
>>> Soy el número 0 (100000000)
>>> Terminando, una instrucción antes del finalize
>>> Soy el número 1 (100000000)
>>> Terminando, una instrucción antes del finalize
>>> [clus9:06105] 1 more process has sent help message help-orte-errmgr-hnp.txt 
>>> / autor_recovering_job
>>> [clus9:06105] [[42095,0],0] ORTE_ERROR_LOG: Error in file 
>>> ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287
>>> [clus9:06105] [[42095,0],0] ORTE_ERROR_LOG: Error in file 
>>> ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287
>>> [clus9:06107] [[42095,1],1] ORTE_ERROR_LOG: Data unpack would read past end 
>>> of buffer in file ../../../../orte/mca/grpcomm/base/grpcomm_base_modex.c at 
>>> line 350
>>> [clus9:06107] [[42095,1],1] ORTE_ERROR_LOG: Data unpack would read past end 
>>> of buffer in file ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c 
>>> at line 323
>>> [clus9:06106] [[42095,1],0] ORTE_ERROR_LOG: Data unpack would read past end 
>>> of buffer in file ../../../../orte/mca/grpcomm/base/grpcomm_base_modex.c at 
>>> line 350
>>> [clus9:06106] [[42095,1],0] ORTE_ERROR_LOG: Data unpack would read past end 
>>> of buffer in file ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c 
>>> at line 323
>>> [clus9:06106] pml:ob1: ft_event(Restart): Failed orte_grpcomm.modex() = -26
>>> [clus9:06107] pml:ob1: ft_event(Restart): Failed orte_grpcomm.modex() = -26
>>> 
>>> As you can see, it keeps looping on the recover. Then when i try to migrate 
>>> this processes using ompi-migrate, i get this:
>>> 
>>> [hmeyer@clus9 ~]$ 
>>> /home/hmeyer/desarrollo/ompi-code/binarios/bin/ompi-migrate -x node9 -t 
>>> node3 18082
>>> --------------------------------------------------------------------------
>>> Error: The Job identified by PID (18082) was not able to migrate processes 
>>> in this
>>>       job. This could be caused by any of the following:
>>>       - Invalid node or rank specified
>>>       - No processes on the indicated node can by migrated
>>>       - Process migration was not enabled for this job. Make sure to 
>>> indicate
>>>         the proper AMCA file: "-am ft-enable-cr-recovery".
>>> --------------------------------------------------------------------------
>>> But, in the terminal where is running the application i get this:
>>> 
>>> [hmeyer@clus9 whoami]$ 
>>> /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun -np 2 -am 
>>> ft-enable-cr-recovery ./whoami 10 10
>>> Antes de MPI_Init
>>> Antes de MPI_Init
>>> [clus9:18082] [[62740,0],0] ORTE_ERROR_LOG: Error in file 
>>> ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287
>>> [clus9:18082] [[62740,0],0] ORTE_ERROR_LOG: Error in file 
>>> ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287
>>> --------------------------------------------------------------------------
>>> Warning: Could not find any processes to migrate on the nodes specified.
>>>         You provided the following:
>>> Nodes: node9
>>> Procs: (null)
>>> --------------------------------------------------------------------------
>>> --------------------------------------------------------------------------
>>> Notice: The processes have been successfully migrated to/from the specified
>>>        machines.
>>> --------------------------------------------------------------------------
>>> Soy el número 1 (100000000)
>>> Terminando, una instrucción antes del finalize
>>> Soy el número 0 (100000000)
>>> Terminando, una instrucción antes del finalize
>>> --------------------------------------------------------------------------
>>> Error: The process below has failed. There is no checkpoint available for
>>>       this job, so we are terminating the application since automatic
>>>       recovery cannot occur.
>>> Internal Name: [[62740,1],0]
>>> MCW Rank: 0
>>> 
>>> --------------------------------------------------------------------------
>>> [clus9:18082] 1 more process has sent help message help-orte-errmgr-hnp.txt 
>>> / autor_failed_to_recover_proc
>>> [clus9:18082] Set MCA parameter "orte_base_help_aggregate" to 0 to see all 
>>> help / error messages
>>> 
>>> I asume that the orte_get_job_data_object is the problem, because it is not 
>>> obtaining the proper value.
>>> 
>>> If you need more data, just let me know.
>>> 
>>> Best Regards.
>>> 
>>> Hugo Meyer
>>> 
>>> 
>>> 
>>> 
>>> 2011/1/26 Joshua Hursey <jjhur...@open-mpi.org>
>>> I found a few more bugs after testing the C/R functionality this morning. I 
>>> just committed some more C/R fixes in r24306 (things are now working 
>>> correctly on my test cluster).
>>> https://svn.open-mpi.org/trac/ompi/changeset/24306
>>> 
>>> One thing I just noticed in your original email was that you are specifying 
>>> the wrong parameter for migration (it is different than the standard C/R 
>>> functionality for backwards compatibility reasons). You need to use the 
>>> 'ft-enable-cr-recovery' AMCA parameter:
>>> mpirun -np 2 -am ft-enable-cr-recovery ./whoami 10 10
>>> 
>>> If you still get the segmentation fault after upgrading to the current 
>>> trunk, can you send me a backtrace from the core file? That will help me 
>>> narrow down on the problem.
>>> 
>>> Thanks,
>>> Josh
>>> 
>>> 
>>> On Jan 26, 2011, at 8:40 AM, Hugo Meyer wrote:
>>> 
>>>> Josh.
>>>> 
>>>> The ompi-checkpoint with his restart now are working great, but the same 
>>>> error persist with ompi-migrate. I've also tried using "-r", but i get the 
>>>> same error.
>>>> 
>>>> Best regards.
>>>> 
>>>> Hugo Meyer
>>>> 
>>>> 2011/1/26 Hugo Meyer <meyer.h...@gmail.com>
>>>> Thanks Josh.
>>>> 
>>>> I've already check te prelink and is set to "no".
>>>> 
>>>> I'm going to try with the trunk head, and then i'll let you know how it 
>>>> goes.
>>>> 
>>>> Best regards.
>>>> 
>>>> Hugo Meyer
>>>> 
>>>> 2011/1/25 Joshua Hursey <jjhur...@open-mpi.org>
>>>> 
>>>> Can you try with the current trunk head (r24296)?
>>>> I just committed a fix for the C/R functionality in which restarts were 
>>>> getting stuck. This will likely affect the migration functionality, but I 
>>>> have not had an opportunity to test just yet.
>>>> 
>>>> Another thing to check is that prelink is turned off on all of your 
>>>> machines.
>>>> https://upc-bugs.lbl.gov//blcr/doc/html/FAQ.html#prelink
>>>> 
>>>> Let me know if the problem persists, and I'll dig into a bit more.
>>>> 
>>>> Thanks,
>>>> Josh
>>>> 
>>>> On Jan 24, 2011, at 11:37 AM, Hugo Meyer wrote:
>>>> 
>>>>> Hello @ll
>>>>> 
>>>>> I've got a problem when i try to use the ompi-migrate command.
>>>>> 
>>>>> What i'm doing is execute for example the next application in one node of 
>>>>> a cluster (both process wil run on the same node):
>>>>> 
>>>>> mpirun -np 2 -am ft-enable-cr ./whoami 10 10
>>>>> 
>>>>> Then in the same node i try to migrate the processes to another node:
>>>>> 
>>>>> ompi-migrate -x node9 -t node3 14914
>>>>> 
>>>>> And then i get this message:
>>>>> 
>>>>> [clus9:15620] *** Process received signal ***
>>>>> [clus9:15620] Signal: Segmentation fault (11)
>>>>> [clus9:15620] Signal code: Address not mapped (1)
>>>>> [clus9:15620] Failing at address: (nil)
>>>>> [clus9:15620] [ 0] /lib64/libpthread.so.0 [0x2aaaac0b8d40]
>>>>> [clus9:15620] *** End of error message ***
>>>>> Segmentation fault
>>>>> 
>>>>> I assume that maybe there is something wrong with the thread level, but i 
>>>>> have configured the open-mpi like this:
>>>>> 
>>>>> ../configure --prefix=/home/hmeyer/desarrollo/ompi-code/binarios/ 
>>>>> --enable-debug --enable-debug-symbols --enable-trace --with-ft=cr 
>>>>> --disable-ipv6 --enable-opal-multi-threads --enable-ft-thread 
>>>>> --without-hwloc --disable-vt --with-blcr=/soft/blcr-0.8.2/ 
>>>>> --with-blcr-libdir=/soft/blcr-0.8.2/lib/
>>>>> 
>>>>> The checkpoint and restart works fine, but when i restore an application 
>>>>> that has more than one process, this one is restored and executed until 
>>>>> the last line before MPI_FINALIZE(), but the processes never finalize, i 
>>>>> assume that they never call the MPI_FINALIZE(), but with one process 
>>>>> ompi-checkpoint and ompi-restart work great.
>>>>> 
>>>>> Best regards.
>>>>> 
>>>>> Hugo Meyer
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> de...@open-mpi.org
>>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> 
>>>> ------------------------------------
>>>> Joshua Hursey
>>>> Postdoctoral Research Associate
>>>> Oak Ridge National Laboratory
>>>> http://users.nccs.gov/~jjhursey
>>>> 
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>>> 
>>>> 
>>>> _______________________________________________
>>>> devel mailing list
>>>> de...@open-mpi.org
>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> ------------------------------------
>>> Joshua Hursey
>>> Postdoctoral Research Associate
>>> Oak Ridge National Laboratory
>>> http://users.nccs.gov/~jjhursey
>>> 
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>>> 
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> ------------------------------------
>> Joshua Hursey
>> Postdoctoral Research Associate
>> Oak Ridge National Laboratory
>> http://users.nccs.gov/~jjhursey
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 
> ------------------------------------
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
> 

------------------------------------
Joshua Hursey
Postdoctoral Research Associate
Oak Ridge National Laboratory
http://users.nccs.gov/~jjhursey

Re: [OMPI devel] OMPI-MIGRATE error

Reply via email to