Thanks to you Joshua.

I will try the procedure with this modifications and i will let you know how
it goes.

Best Regards.

Hugo Meyer

2011/1/27 Joshua Hursey <jjhur...@open-mpi.org>

> I believe that this is now fixed on the trunk. All the details are in the
> commit message:
>  https://svn.open-mpi.org/trac/ompi/changeset/24317
>
> In my testing yesterday, I did not test the scenario where the node with
> mpirun also contains processes (the test cluster I was using does not by
> default run this way). So I was able to reproduce by running on a single
> node. There were a couple bugs that emerged that are fixed in the commit.
> The two bugs that were hurting you were the TCP socket cleanup (which caused
> the looping of the automatic recovery), and the incorrect accounting of
> local process termination (which caused the modex errors).
>
> Let me know if that fixes the problems that you were seeing.
>
> Thanks for the bug report and your patience while I pursued a fix.
>
> -- Josh
>
> On Jan 27, 2011, at 11:28 AM, Hugo Meyer wrote:
>
> > Hi Josh.
> >
> > Thanks for your reply. I'll tell you what i'm getting now from the
> executions in the next lines.
> > When i run without doing a checkpoint i get this output, and the process
> don' finish:
> >
> > [hmeyer@clus9 whoami]$
> /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun -np 2 -am
> ft-enable-cr-recovery ./whoami 10 10
> > Antes de MPI_Init
> > Antes de MPI_Init
> > [clus9:04985] [[41167,0],0] ORTE_ERROR_LOG: Error in file
> ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287
> > [clus9:04985] [[41167,0],0] ORTE_ERROR_LOG: Error in file
> ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287
> > Soy el número 1 (100000000)
> > Terminando, una instrucción antes del finalize
> > Soy el número 0 (100000000)
> > Terminando, una instrucción antes del finalize
> >
> --------------------------------------------------------------------------
> > Error: The process below has failed. There is no checkpoint available for
> >        this job, so we are terminating the application since automatic
> >        recovery cannot occur.
> > Internal Name: [[41167,1],0]
> > MCW Rank: 0
> >
> >
> --------------------------------------------------------------------------
> > [clus9:04985] 1 more process has sent help message
> help-orte-errmgr-hnp.txt / autor_failed_to_recover_proc
> > [clus9:04985] Set MCA parameter "orte_base_help_aggregate" to 0 to see
> all help / error messages
> >
> > If i make a checkpoint in another terminal of the mpirun process, during
> the execution, i get this output:
> >
> > [clus9:06105] [[42095,0],0] ORTE_ERROR_LOG: Error in file
> ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287
> > [clus9:06105] [[42095,0],0] ORTE_ERROR_LOG: Error in file
> ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287
> > [clus9:06107] [[42095,1],1] ORTE_ERROR_LOG: Data unpack would read past
> end of buffer in file ../../../../orte/mca/grpcomm/base/grpcomm_base_modex.c
> at line 350
> > [clus9:06107] [[42095,1],1] ORTE_ERROR_LOG: Data unpack would read past
> end of buffer in file
> ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c at line 323
> > [clus9:06107] pml:ob1: ft_event(Restart): Failed orte_grpcomm.modex() =
> -26
> > [clus9:06106] [[42095,1],0] ORTE_ERROR_LOG: Data unpack would read past
> end of buffer in file ../../../../orte/mca/grpcomm/base/grpcomm_base_modex.c
> at line 350
> > [clus9:06106] [[42095,1],0] ORTE_ERROR_LOG: Data unpack would read past
> end of buffer in file
> ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c at line 323
> > [clus9:06106] pml:ob1: ft_event(Restart): Failed orte_grpcomm.modex() =
> -26
> >
> --------------------------------------------------------------------------
> > Notice: The job has been successfully recovered from the
> >         last checkpoint.
> >
> --------------------------------------------------------------------------
> > Soy el número 1 (100000000)
> > Terminando, una instrucción antes del finalize
> > Soy el número 0 (100000000)
> > Terminando, una instrucción antes del finalize
> > [clus9:06105] 1 more process has sent help message
> help-orte-errmgr-hnp.txt / autor_recovering_job
> > [clus9:06105] Set MCA parameter "orte_base_help_aggregate" to 0 to see
> all help / error messages
> > [clus9:06105] [[42095,0],0] ORTE_ERROR_LOG: Error in file
> ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287
> > [clus9:06105] [[42095,0],0] ORTE_ERROR_LOG: Error in file
> ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287
> > [clus9:06107] [[42095,1],1] ORTE_ERROR_LOG: Data unpack would read past
> end of buffer in file ../../../../orte/mca/grpcomm/base/grpcomm_base_modex.c
> at line 350
> > [clus9:06107] [[42095,1],1] ORTE_ERROR_LOG: Data unpack would read past
> end of buffer in file
> ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c at line 323
> > [clus9:06107] pml:ob1: ft_event(Restart): Failed orte_grpcomm.modex() =
> -26
> > [clus9:06106] [[42095,1],0] ORTE_ERROR_LOG: Data unpack would read past
> end of buffer in file ../../../../orte/mca/grpcomm/base/grpcomm_base_modex.c
> at line 350
> > [clus9:06106] [[42095,1],0] ORTE_ERROR_LOG: Data unpack would read past
> end of buffer in file
> ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c at line 323
> > [clus9:06106] pml:ob1: ft_event(Restart): Failed orte_grpcomm.modex() =
> -26
> > [clus9:06105] 1 more process has sent help message
> help-orte-errmgr-hnp.txt / autor_recovery_complete
> > Soy el número 0 (100000000)
> > Terminando, una instrucción antes del finalize
> > Soy el número 1 (100000000)
> > Terminando, una instrucción antes del finalize
> > [clus9:06105] 1 more process has sent help message
> help-orte-errmgr-hnp.txt / autor_recovering_job
> > [clus9:06105] [[42095,0],0] ORTE_ERROR_LOG: Error in file
> ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287
> > [clus9:06105] [[42095,0],0] ORTE_ERROR_LOG: Error in file
> ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287
> > [clus9:06107] [[42095,1],1] ORTE_ERROR_LOG: Data unpack would read past
> end of buffer in file ../../../../orte/mca/grpcomm/base/grpcomm_base_modex.c
> at line 350
> > [clus9:06107] [[42095,1],1] ORTE_ERROR_LOG: Data unpack would read past
> end of buffer in file
> ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c at line 323
> > [clus9:06106] [[42095,1],0] ORTE_ERROR_LOG: Data unpack would read past
> end of buffer in file ../../../../orte/mca/grpcomm/base/grpcomm_base_modex.c
> at line 350
> > [clus9:06106] [[42095,1],0] ORTE_ERROR_LOG: Data unpack would read past
> end of buffer in file
> ../../../../../orte/mca/grpcomm/bad/grpcomm_bad_module.c at line 323
> > [clus9:06106] pml:ob1: ft_event(Restart): Failed orte_grpcomm.modex() =
> -26
> > [clus9:06107] pml:ob1: ft_event(Restart): Failed orte_grpcomm.modex() =
> -26
> >
> > As you can see, it keeps looping on the recover. Then when i try to
> migrate this processes using ompi-migrate, i get this:
> >
> > [hmeyer@clus9 ~]$
> /home/hmeyer/desarrollo/ompi-code/binarios/bin/ompi-migrate -x node9 -t
> node3 18082
> >
> --------------------------------------------------------------------------
> > Error: The Job identified by PID (18082) was not able to migrate
> processes in this
> >        job. This could be caused by any of the following:
> >        - Invalid node or rank specified
> >        - No processes on the indicated node can by migrated
> >        - Process migration was not enabled for this job. Make sure to
> indicate
> >          the proper AMCA file: "-am ft-enable-cr-recovery".
> >
> --------------------------------------------------------------------------
> > But, in the terminal where is running the application i get this:
> >
> > [hmeyer@clus9 whoami]$
> /home/hmeyer/desarrollo/ompi-code/binarios/bin/mpirun -np 2 -am
> ft-enable-cr-recovery ./whoami 10 10
> > Antes de MPI_Init
> > Antes de MPI_Init
> > [clus9:18082] [[62740,0],0] ORTE_ERROR_LOG: Error in file
> ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287
> > [clus9:18082] [[62740,0],0] ORTE_ERROR_LOG: Error in file
> ../../../../../orte/mca/errmgr/hnp/errmgr_hnp_crmig.c at line 287
> >
> --------------------------------------------------------------------------
> > Warning: Could not find any processes to migrate on the nodes specified.
> >          You provided the following:
> > Nodes: node9
> > Procs: (null)
> >
> --------------------------------------------------------------------------
> >
> --------------------------------------------------------------------------
> > Notice: The processes have been successfully migrated to/from the
> specified
> >         machines.
> >
> --------------------------------------------------------------------------
> > Soy el número 1 (100000000)
> > Terminando, una instrucción antes del finalize
> > Soy el número 0 (100000000)
> > Terminando, una instrucción antes del finalize
> >
> --------------------------------------------------------------------------
> > Error: The process below has failed. There is no checkpoint available for
> >        this job, so we are terminating the application since automatic
> >        recovery cannot occur.
> > Internal Name: [[62740,1],0]
> > MCW Rank: 0
> >
> >
> --------------------------------------------------------------------------
> > [clus9:18082] 1 more process has sent help message
> help-orte-errmgr-hnp.txt / autor_failed_to_recover_proc
> > [clus9:18082] Set MCA parameter "orte_base_help_aggregate" to 0 to see
> all help / error messages
> >
> > I asume that the orte_get_job_data_object is the problem, because it is
> not obtaining the proper value.
> >
> > If you need more data, just let me know.
> >
> > Best Regards.
> >
> > Hugo Meyer
> >
> >
> >
> >
> > 2011/1/26 Joshua Hursey <jjhur...@open-mpi.org>
> > I found a few more bugs after testing the C/R functionality this morning.
> I just committed some more C/R fixes in r24306 (things are now working
> correctly on my test cluster).
> >  https://svn.open-mpi.org/trac/ompi/changeset/24306
> >
> > One thing I just noticed in your original email was that you are
> specifying the wrong parameter for migration (it is different than the
> standard C/R functionality for backwards compatibility reasons). You need to
> use the 'ft-enable-cr-recovery' AMCA parameter:
> >  mpirun -np 2 -am ft-enable-cr-recovery ./whoami 10 10
> >
> > If you still get the segmentation fault after upgrading to the current
> trunk, can you send me a backtrace from the core file? That will help me
> narrow down on the problem.
> >
> > Thanks,
> > Josh
> >
> >
> > On Jan 26, 2011, at 8:40 AM, Hugo Meyer wrote:
> >
> > > Josh.
> > >
> > > The ompi-checkpoint with his restart now are working great, but the
> same error persist with ompi-migrate. I've also tried using "-r", but i get
> the same error.
> > >
> > > Best regards.
> > >
> > > Hugo Meyer
> > >
> > > 2011/1/26 Hugo Meyer <meyer.h...@gmail.com>
> > > Thanks Josh.
> > >
> > > I've already check te prelink and is set to "no".
> > >
> > > I'm going to try with the trunk head, and then i'll let you know how it
> goes.
> > >
> > > Best regards.
> > >
> > > Hugo Meyer
> > >
> > > 2011/1/25 Joshua Hursey <jjhur...@open-mpi.org>
> > >
> > > Can you try with the current trunk head (r24296)?
> > > I just committed a fix for the C/R functionality in which restarts were
> getting stuck. This will likely affect the migration functionality, but I
> have not had an opportunity to test just yet.
> > >
> > > Another thing to check is that prelink is turned off on all of your
> machines.
> > >  https://upc-bugs.lbl.gov//blcr/doc/html/FAQ.html#prelink
> > >
> > > Let me know if the problem persists, and I'll dig into a bit more.
> > >
> > > Thanks,
> > > Josh
> > >
> > > On Jan 24, 2011, at 11:37 AM, Hugo Meyer wrote:
> > >
> > > > Hello @ll
> > > >
> > > > I've got a problem when i try to use the ompi-migrate command.
> > > >
> > > > What i'm doing is execute for example the next application in one
> node of a cluster (both process wil run on the same node):
> > > >
> > > > mpirun -np 2 -am ft-enable-cr ./whoami 10 10
> > > >
> > > > Then in the same node i try to migrate the processes to another node:
> > > >
> > > > ompi-migrate -x node9 -t node3 14914
> > > >
> > > > And then i get this message:
> > > >
> > > > [clus9:15620] *** Process received signal ***
> > > > [clus9:15620] Signal: Segmentation fault (11)
> > > > [clus9:15620] Signal code: Address not mapped (1)
> > > > [clus9:15620] Failing at address: (nil)
> > > > [clus9:15620] [ 0] /lib64/libpthread.so.0 [0x2aaaac0b8d40]
> > > > [clus9:15620] *** End of error message ***
> > > > Segmentation fault
> > > >
> > > > I assume that maybe there is something wrong with the thread level,
> but i have configured the open-mpi like this:
> > > >
> > > > ../configure --prefix=/home/hmeyer/desarrollo/ompi-code/binarios/
> --enable-debug --enable-debug-symbols --enable-trace --with-ft=cr
> --disable-ipv6 --enable-opal-multi-threads --enable-ft-thread
> --without-hwloc --disable-vt --with-blcr=/soft/blcr-0.8.2/
> --with-blcr-libdir=/soft/blcr-0.8.2/lib/
> > > >
> > > > The checkpoint and restart works fine, but when i restore an
> application that has more than one process, this one is restored and
> executed until the last line before MPI_FINALIZE(), but the processes never
> finalize, i assume that they never call the MPI_FINALIZE(), but with one
> process ompi-checkpoint and ompi-restart work great.
> > > >
> > > > Best regards.
> > > >
> > > > Hugo Meyer
> > > > _______________________________________________
> > > > devel mailing list
> > > > de...@open-mpi.org
> > > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > >
> > > ------------------------------------
> > > Joshua Hursey
> > > Postdoctoral Research Associate
> > > Oak Ridge National Laboratory
> > > http://users.nccs.gov/~jjhursey
> > >
> > >
> > > _______________________________________________
> > > devel mailing list
> > > de...@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > >
> > >
> > > _______________________________________________
> > > devel mailing list
> > > de...@open-mpi.org
> > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
> > ------------------------------------
> > Joshua Hursey
> > Postdoctoral Research Associate
> > Oak Ridge National Laboratory
> > http://users.nccs.gov/~jjhursey
> >
> >
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> >
> > _______________________________________________
> > devel mailing list
> > de...@open-mpi.org
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
>
> ------------------------------------
> Joshua Hursey
> Postdoctoral Research Associate
> Oak Ridge National Laboratory
> http://users.nccs.gov/~jjhursey
>
>
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>

Reply via email to