So I was not able to reproduce this issue. A couple notes: - You can see the node-to-process-rank mapping using the '-display-map' command line option to mpirun. This will give you the node names that Open MPI is using, and how it intends to layout the processes. You can use the '-display-allocation' option to see all of the nodes that Open MPI knows about. Open MPI cannot, currently, migrate to a node that it does not know about on startup. - If the problem persists, add the following MCA parameters to your ~/.openmpi/mca-params.conf file and send me a zipped-up text file of the output. It might show us where things are going wrong: ---------------- orte_debug_daemons=1 errmgr_base_verbose=20 snapc_full_verbose=20 ----------------
-- Josh On Jan 31, 2011, at 9:46 AM, Joshua Hursey wrote: > > On Jan 31, 2011, at 6:47 AM, Hugo Meyer wrote: > >> Hi Joshua. >> >> I've tried the migration again, and i get the next (running process where >> mpirun is running): >> >> Terminal 1: >> >> [hmeyer@clus9 whoami]$ /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun >> -np 2 -am ft-enable-cr-recovery --mca orte_base_help_aggregate 0 ./whoami 10 >> 10 >> Antes de MPI_Init >> Antes de MPI_Init >> -------------------------------------------------------------------------- >> Warning: Could not find any processes to migrate on the nodes specified. >> You provided the following: >> Nodes: node9 >> Procs: (null) >> -------------------------------------------------------------------------- >> Soy el número 1 (100000000) >> Terminando, una instrucción antes del finalize >> Soy el número 0 (100000000) >> Terminando, una instrucción antes del finalize >> >> Terminal 2: >> >> [hmeyer@clus9 build]$ >> /home/hmeyer/desarrollo/ompi-code/binarios/bin/ompi-migrate -x node9 -t >> node3 11724 >> -------------------------------------------------------------------------- >> Error: The Job identified by PID (11724) was not able to migrate processes >> in this >> job. This could be caused by any of the following: >> - Invalid node or rank specified >> - No processes on the indicated node can by migrated >> - Process migration was not enabled for this job. Make sure to indicate >> the proper AMCA file: "-am ft-enable-cr-recovery". >> -------------------------------------------------------------------------- > > The error message indicates that there were no processes found on 'node9'. > Did you confirm that there were processes running on that node? > > It is possible that the node name that Open MPI is using is different than > what you put in. For example it could be fully qualified (e.g., > node9.my.domain.com). So you might try that too. MPI_Get_processor_name() > should return the name of the node that we are attempting to use. So you > could have all processes print that out when the startup. > > >> Then i try another way, and i get the next: >> >> Terminal 1: >> >> [hmeyer@clus9 whoami]$ /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun >> -np 3 -am ft-enable-cr-recovery ./whoami 10 10 >> Antes de MPI_Init >> Antes de MPI_Init >> Antes de MPI_Init >> -------------------------------------------------------------------------- >> Notice: A migration of this job has been requested. >> The processes below will be migrated. >> Please standby. >> [[40382,1],1] Rank 1 on Node clus9 >> >> -------------------------------------------------------------------------- >> -------------------------------------------------------------------------- >> Error: The process below has failed. There is no checkpoint available for >> this job, so we are terminating the application since automatic >> recovery cannot occur. >> Internal Name: [[40382,1],1] >> MCW Rank: 1 >> >> -------------------------------------------------------------------------- >> Soy el número 0 (100000000) >> Terminando, una instrucción antes del finalize >> Soy el número 2 (100000000) >> Terminando, una instrucción antes del finalize >> >> Terminal 2: >> >> [hmeyer@clus9 build]$ >> /home/hmeyer/desarrollo/ompi-code/binarios/bin/ompi-migrate -r 1 -t node3 >> 11784 >> [clus9:11795] *** Process received signal *** >> [clus9:11795] Signal: Segmentation fault (11) >> [clus9:11795] Signal code: Address not mapped (1) >> [clus9:11795] Failing at address: (nil) >> [clus9:11795] [ 0] /lib64/libpthread.so.0 [0x2aaaac0b9d40] >> [clus9:11795] *** End of error message *** >> Segmentation fault > > Humm. Well that's not good. It looks like the automatic recovery is jumping > in while migrating, which should not be happening. I'll take a look and see > if I can reproduce locally. > > Thanks, > Josh > >> >> I'm using the ompi-migrate command in the right way? or i am missing >> something? Because the first attempt didn't find any process. >> >> Best Regards. >> >> Hugo Meyer >> >> >> 2011/1/28 Hugo Meyer <meyer.h...@gmail.com> >> Thanks to you Joshua. >> >> I will try the procedure with this modifications and i will let you know how >> it goes. >> >> Best Regards. >> >> Hugo Meyer >> >> 2011/1/27 Joshua Hursey <jjhur...@open-mpi.org> >> >> I believe that this is now fixed on the trunk. All the details are in the >> commit message: >> https://svn.open-mpi.org/trac/ompi/changeset/24317 >> >> In my testing yesterday, I did not test the scenario where the node with >> mpirun also contains processes (the test cluster I was using does not by >> default run this way). So I was able to reproduce by running on a single >> node. There were a couple bugs that emerged that are fixed in the commit. >> The two bugs that were hurting you were the TCP socket cleanup (which caused >> the looping of the automatic recovery), and the incorrect accounting of >> local process termination (which caused the modex errors). >> >> Let me know if that fixes the problems that you were seeing. >> >> Thanks for the bug report and your patience while I pursued a fix. >> >> -- Josh >> >> On Jan 27, 2011, at 11:28 AM, Hugo Meyer wrote: >> >>> Hi Josh. >>> >>> Thanks for your reply. I'll tell you what i'm getting now from the >>> executions in the next lines. >>> When i run without doing a checkpoint i get this output, and the process >>> don' finish: >>> >>> [hmeyer@clus9 whoami]$ >>> /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun -np 2 -am >>> ft-enable-cr-recovery ./whoami 10 10 >>> Antes de MPI_Init >>> Antes de MPI_Init >>> [clus9:04985] [[41167,0],0] ORTE_ERROR_LOG: Error in file >>> ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287 >>> [clus9:04985] [[41167,0],0] ORTE_ERROR_LOG: Error in file >>> ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287 >>> Soy el número 1 (100000000) >>> Terminando, una instrucción antes del finalize >>> Soy el número 0 (100000000) >>> Terminando, una instrucción antes del finalize >>> -------------------------------------------------------------------------- >>> Error: The process below has failed. There is no checkpoint available for >>> this job, so we are terminating the application since automatic >>> recovery cannot occur. >>> Internal Name: [[41167,1],0] >>> MCW Rank: 0 >>> >>> -------------------------------------------------------------------------- >>> [clus9:04985] 1 more process has sent help message help-orte-errmgr-hnp.txt >>> / autor_failed_to_recover_proc >>> [clus9:04985] Set MCA parameter "orte_base_help_aggregate" to 0 to see all >>> help / error messages >>> >>> If i make a checkpoint in another terminal of the mpirun process, during >>> the execution, i get this output: >>> >>> [clus9:06105] [[42095,0],0] ORTE_ERROR_LOG: Error in file >>> ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287 >>> [clus9:06105] [[42095,0],0] ORTE_ERROR_LOG: Error in file >>> ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287 >>> [clus9:06107] [[42095,1],1] ORTE_ERROR_LOG: Data unpack would read past end >>> of buffer in file ../../../../orte/mca/grpcomm/base/grpcomm_base_modex.c at >>> line 350 >>> [clus9:06107] [[42095,1],1] ORTE_ERROR_LOG: Data unpack would read past end >>> of buffer in file ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c >>> at line 323 >>> [clus9:06107] pml:ob1: ft_event(Restart): Failed orte_grpcomm.modex() = -26 >>> [clus9:06106] [[42095,1],0] ORTE_ERROR_LOG: Data unpack would read past end >>> of buffer in file ../../../../orte/mca/grpcomm/base/grpcomm_base_modex.c at >>> line 350 >>> [clus9:06106] [[42095,1],0] ORTE_ERROR_LOG: Data unpack would read past end >>> of buffer in file ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c >>> at line 323 >>> [clus9:06106] pml:ob1: ft_event(Restart): Failed orte_grpcomm.modex() = -26 >>> -------------------------------------------------------------------------- >>> Notice: The job has been successfully recovered from the >>> last checkpoint. >>> -------------------------------------------------------------------------- >>> Soy el número 1 (100000000) >>> Terminando, una instrucción antes del finalize >>> Soy el número 0 (100000000) >>> Terminando, una instrucción antes del finalize >>> [clus9:06105] 1 more process has sent help message help-orte-errmgr-hnp.txt >>> / autor_recovering_job >>> [clus9:06105] Set MCA parameter "orte_base_help_aggregate" to 0 to see all >>> help / error messages >>> [clus9:06105] [[42095,0],0] ORTE_ERROR_LOG: Error in file >>> ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287 >>> [clus9:06105] [[42095,0],0] ORTE_ERROR_LOG: Error in file >>> ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287 >>> [clus9:06107] [[42095,1],1] ORTE_ERROR_LOG: Data unpack would read past end >>> of buffer in file ../../../../orte/mca/grpcomm/base/grpcomm_base_modex.c at >>> line 350 >>> [clus9:06107] [[42095,1],1] ORTE_ERROR_LOG: Data unpack would read past end >>> of buffer in file ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c >>> at line 323 >>> [clus9:06107] pml:ob1: ft_event(Restart): Failed orte_grpcomm.modex() = -26 >>> [clus9:06106] [[42095,1],0] ORTE_ERROR_LOG: Data unpack would read past end >>> of buffer in file ../../../../orte/mca/grpcomm/base/grpcomm_base_modex.c at >>> line 350 >>> [clus9:06106] [[42095,1],0] ORTE_ERROR_LOG: Data unpack would read past end >>> of buffer in file ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c >>> at line 323 >>> [clus9:06106] pml:ob1: ft_event(Restart): Failed orte_grpcomm.modex() = -26 >>> [clus9:06105] 1 more process has sent help message help-orte-errmgr-hnp.txt >>> / autor_recovery_complete >>> Soy el número 0 (100000000) >>> Terminando, una instrucción antes del finalize >>> Soy el número 1 (100000000) >>> Terminando, una instrucción antes del finalize >>> [clus9:06105] 1 more process has sent help message help-orte-errmgr-hnp.txt >>> / autor_recovering_job >>> [clus9:06105] [[42095,0],0] ORTE_ERROR_LOG: Error in file >>> ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287 >>> [clus9:06105] [[42095,0],0] ORTE_ERROR_LOG: Error in file >>> ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287 >>> [clus9:06107] [[42095,1],1] ORTE_ERROR_LOG: Data unpack would read past end >>> of buffer in file ../../../../orte/mca/grpcomm/base/grpcomm_base_modex.c at >>> line 350 >>> [clus9:06107] [[42095,1],1] ORTE_ERROR_LOG: Data unpack would read past end >>> of buffer in file ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c >>> at line 323 >>> [clus9:06106] [[42095,1],0] ORTE_ERROR_LOG: Data unpack would read past end >>> of buffer in file ../../../../orte/mca/grpcomm/base/grpcomm_base_modex.c at >>> line 350 >>> [clus9:06106] [[42095,1],0] ORTE_ERROR_LOG: Data unpack would read past end >>> of buffer in file ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c >>> at line 323 >>> [clus9:06106] pml:ob1: ft_event(Restart): Failed orte_grpcomm.modex() = -26 >>> [clus9:06107] pml:ob1: ft_event(Restart): Failed orte_grpcomm.modex() = -26 >>> >>> As you can see, it keeps looping on the recover. Then when i try to migrate >>> this processes using ompi-migrate, i get this: >>> >>> [hmeyer@clus9 ~]$ >>> /home/hmeyer/desarrollo/ompi-code/binarios/bin/ompi-migrate -x node9 -t >>> node3 18082 >>> -------------------------------------------------------------------------- >>> Error: The Job identified by PID (18082) was not able to migrate processes >>> in this >>> job. This could be caused by any of the following: >>> - Invalid node or rank specified >>> - No processes on the indicated node can by migrated >>> - Process migration was not enabled for this job. Make sure to >>> indicate >>> the proper AMCA file: "-am ft-enable-cr-recovery". >>> -------------------------------------------------------------------------- >>> But, in the terminal where is running the application i get this: >>> >>> [hmeyer@clus9 whoami]$ >>> /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun -np 2 -am >>> ft-enable-cr-recovery ./whoami 10 10 >>> Antes de MPI_Init >>> Antes de MPI_Init >>> [clus9:18082] [[62740,0],0] ORTE_ERROR_LOG: Error in file >>> ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287 >>> [clus9:18082] [[62740,0],0] ORTE_ERROR_LOG: Error in file >>> ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287 >>> -------------------------------------------------------------------------- >>> Warning: Could not find any processes to migrate on the nodes specified. >>> You provided the following: >>> Nodes: node9 >>> Procs: (null) >>> -------------------------------------------------------------------------- >>> -------------------------------------------------------------------------- >>> Notice: The processes have been successfully migrated to/from the specified >>> machines. >>> -------------------------------------------------------------------------- >>> Soy el número 1 (100000000) >>> Terminando, una instrucción antes del finalize >>> Soy el número 0 (100000000) >>> Terminando, una instrucción antes del finalize >>> -------------------------------------------------------------------------- >>> Error: The process below has failed. There is no checkpoint available for >>> this job, so we are terminating the application since automatic >>> recovery cannot occur. >>> Internal Name: [[62740,1],0] >>> MCW Rank: 0 >>> >>> -------------------------------------------------------------------------- >>> [clus9:18082] 1 more process has sent help message help-orte-errmgr-hnp.txt >>> / autor_failed_to_recover_proc >>> [clus9:18082] Set MCA parameter "orte_base_help_aggregate" to 0 to see all >>> help / error messages >>> >>> I asume that the orte_get_job_data_object is the problem, because it is not >>> obtaining the proper value. >>> >>> If you need more data, just let me know. >>> >>> Best Regards. >>> >>> Hugo Meyer >>> >>> >>> >>> >>> 2011/1/26 Joshua Hursey <jjhur...@open-mpi.org> >>> I found a few more bugs after testing the C/R functionality this morning. I >>> just committed some more C/R fixes in r24306 (things are now working >>> correctly on my test cluster). >>> https://svn.open-mpi.org/trac/ompi/changeset/24306 >>> >>> One thing I just noticed in your original email was that you are specifying >>> the wrong parameter for migration (it is different than the standard C/R >>> functionality for backwards compatibility reasons). You need to use the >>> 'ft-enable-cr-recovery' AMCA parameter: >>> mpirun -np 2 -am ft-enable-cr-recovery ./whoami 10 10 >>> >>> If you still get the segmentation fault after upgrading to the current >>> trunk, can you send me a backtrace from the core file? That will help me >>> narrow down on the problem. >>> >>> Thanks, >>> Josh >>> >>> >>> On Jan 26, 2011, at 8:40 AM, Hugo Meyer wrote: >>> >>>> Josh. >>>> >>>> The ompi-checkpoint with his restart now are working great, but the same >>>> error persist with ompi-migrate. I've also tried using "-r", but i get the >>>> same error. >>>> >>>> Best regards. >>>> >>>> Hugo Meyer >>>> >>>> 2011/1/26 Hugo Meyer <meyer.h...@gmail.com> >>>> Thanks Josh. >>>> >>>> I've already check te prelink and is set to "no". >>>> >>>> I'm going to try with the trunk head, and then i'll let you know how it >>>> goes. >>>> >>>> Best regards. >>>> >>>> Hugo Meyer >>>> >>>> 2011/1/25 Joshua Hursey <jjhur...@open-mpi.org> >>>> >>>> Can you try with the current trunk head (r24296)? >>>> I just committed a fix for the C/R functionality in which restarts were >>>> getting stuck. This will likely affect the migration functionality, but I >>>> have not had an opportunity to test just yet. >>>> >>>> Another thing to check is that prelink is turned off on all of your >>>> machines. >>>> https://upc-bugs.lbl.gov//blcr/doc/html/FAQ.html#prelink >>>> >>>> Let me know if the problem persists, and I'll dig into a bit more. >>>> >>>> Thanks, >>>> Josh >>>> >>>> On Jan 24, 2011, at 11:37 AM, Hugo Meyer wrote: >>>> >>>>> Hello @ll >>>>> >>>>> I've got a problem when i try to use the ompi-migrate command. >>>>> >>>>> What i'm doing is execute for example the next application in one node of >>>>> a cluster (both process wil run on the same node): >>>>> >>>>> mpirun -np 2 -am ft-enable-cr ./whoami 10 10 >>>>> >>>>> Then in the same node i try to migrate the processes to another node: >>>>> >>>>> ompi-migrate -x node9 -t node3 14914 >>>>> >>>>> And then i get this message: >>>>> >>>>> [clus9:15620] *** Process received signal *** >>>>> [clus9:15620] Signal: Segmentation fault (11) >>>>> [clus9:15620] Signal code: Address not mapped (1) >>>>> [clus9:15620] Failing at address: (nil) >>>>> [clus9:15620] [ 0] /lib64/libpthread.so.0 [0x2aaaac0b8d40] >>>>> [clus9:15620] *** End of error message *** >>>>> Segmentation fault >>>>> >>>>> I assume that maybe there is something wrong with the thread level, but i >>>>> have configured the open-mpi like this: >>>>> >>>>> ../configure --prefix=/home/hmeyer/desarrollo/ompi-code/binarios/ >>>>> --enable-debug --enable-debug-symbols --enable-trace --with-ft=cr >>>>> --disable-ipv6 --enable-opal-multi-threads --enable-ft-thread >>>>> --without-hwloc --disable-vt --with-blcr=/soft/blcr-0.8.2/ >>>>> --with-blcr-libdir=/soft/blcr-0.8.2/lib/ >>>>> >>>>> The checkpoint and restart works fine, but when i restore an application >>>>> that has more than one process, this one is restored and executed until >>>>> the last line before MPI_FINALIZE(), but the processes never finalize, i >>>>> assume that they never call the MPI_FINALIZE(), but with one process >>>>> ompi-checkpoint and ompi-restart work great. >>>>> >>>>> Best regards. >>>>> >>>>> Hugo Meyer >>>>> _______________________________________________ >>>>> devel mailing list >>>>> de...@open-mpi.org >>>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>>> ------------------------------------ >>>> Joshua Hursey >>>> Postdoctoral Research Associate >>>> Oak Ridge National Laboratory >>>> http://users.nccs.gov/~jjhursey >>>> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>>> >>>> >>>> _______________________________________________ >>>> devel mailing list >>>> de...@open-mpi.org >>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> ------------------------------------ >>> Joshua Hursey >>> Postdoctoral Research Associate >>> Oak Ridge National Laboratory >>> http://users.nccs.gov/~jjhursey >>> >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> ------------------------------------ >> Joshua Hursey >> Postdoctoral Research Associate >> Oak Ridge National Laboratory >> http://users.nccs.gov/~jjhursey >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > ------------------------------------ > Joshua Hursey > Postdoctoral Research Associate > Oak Ridge National Laboratory > http://users.nccs.gov/~jjhursey > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel > ------------------------------------ Joshua Hursey Postdoctoral Research Associate Oak Ridge National Laboratory http://users.nccs.gov/~jjhursey