Right. Sorry I misspoke.

On Thursday, June 23, 2011 at 3:32 PM, Ralph Castain wrote:

> Ummm...just to clarify. There are no threads in ORTE, so it wasn't a problem 
> of "not giving up the thread". The problem was that Josh's test never called 
> progress. It would have been equally okay to simply call 
> "opal_event_dispatch" while waiting for the callback.
> 
> All applications have to cycle the progress engine.
> 
> 
> On Jun 23, 2011, at 1:18 PM, Wesley Bland wrote:
> > Josh,
> > 
> > There were a couple of bugs that I cleared up in my most recent checkin, 
> > but I also needed to modify your test. The callback for the application 
> > layer errmgr actually occurs in the application layer. Your test was never 
> > giving up the thread to the ORTE application event loop to receive its 
> > message from the ORTED. I changed your while loop to an 
> > ORTE_PROGRESSED_WAIT and that fixed the problem.
> > 
> > Try running the attached code with the modifications and see if that clears 
> > up the problem. It did for me.
> > 
> > Thanks,
> > Wesley
> > 
> > On Thursday, June 23, 2011 at 10:16 AM, Josh Hursey wrote:
> > 
> > > So I finally got a chance to test the branch this morning. I cannot
> > > get it to work. Maybe I'm doing some wrong, missing some MCA
> > > parameter?
> > > 
> > > -------------------------
> > > [jjhursey@smoky-login1 resilient-orte] hg summary
> > > parent: 2:c550cf6ed6a2 tip
> > >  Newest version. Synced with trunk r24785.
> > > branch: default
> > > commit: 1 modified, 8097 unknown
> > > update: (current)
> > > -------------------------
> > > (the 1 modified was the test program attached)
> > > 
> > > Attached is a modified version of the orte_abort.c program found in
> > > ${top}/orte/test/system. This program is ORTE only, and registers the
> > > errmgr callback to trigger correct termination. You will need to
> > > configure Open MPI with '--with-devel-headers' to build this. But then
> > > you can compile with:
> > >  ortecc -g orte_abort.c -o orte_abort
> > > 
> > > These are the configure options that I used:
> > >  --with-devel-headers --enable-binaries --disable-io-romio
> > > --enable-contrib-no-build=vt --enable-debug CC=gcc CXX=g++
> > > F77=gfortran FC=gfortran
> > > 
> > > 
> > > If the HNP has no processes on it - I get a hang:
> > > -------------------------------
> > > mpirun -np 4 --nolocal orte_abort
> > > orte_abort: Name [[60121,1],0,0] Host: smoky13 Pid 3688 -- Initalized
> > > orte_abort: Name [[60121,1],1,0] Host: smoky13 Pid 3689 -- Initalized
> > > orte_abort: Name [[60121,1],2,0] Host: smoky13 Pid 3690 -- Initalized
> > > orte_abort: Name [[60121,1],3,0] Host: smoky13 Pid 3691 -- Initalized
> > > orte_abort: Name [[60121,1],3,0] Host: smoky13 Pid 3691 -- Calling Abort
> > > mpirun: killing job...
> > > 
> > > [smoky14:04002] [[60121,0],0,0] ORTE_ERROR_LOG: Data unpack would read
> > > past end of buffer in file errmgr_hnp.c at line 824
> > > [smoky14:04002] [[60121,0],0,0] ORTE_ERROR_LOG: Data unpack would read
> > > past end of buffer in file orted/orted_comm.c at line 1341
> > > mpirun: abort is already in progress...hit ctrl-c again to forcibly 
> > > terminate
> > > 
> > > [jjhursey@smoky14 system] echo $?
> > > 1
> > > -------------------------------
> > > 
> > > If the HNP has processes on it, but not the one that aborted - I get a 
> > > hang:
> > > -------------------------------
> > > [jjhursey@smoky14 system] mpirun -np 4 --npernode 2 orte_abort
> > > orte_abort: Name [[60302,1],0,0] Host: smoky14 Pid 3830 -- Initalized
> > > orte_abort: Name [[60302,1],1,0] Host: smoky14 Pid 3831 -- Initalized
> > > orte_abort: Name [[60302,1],2,0] Host: smoky13 Pid 3484 -- Initalized
> > > orte_abort: Name [[60302,1],3,0] Host: smoky13 Pid 3485 -- Initalized
> > > orte_abort: Name [[60302,1],3,0] Host: smoky13 Pid 3485 -- Calling Abort
> > > mpirun: killing job...
> > > 
> > > [smoky14:03829] [[60302,0],0,0]-[[60302,1],1,0] mca_oob_tcp_msg_recv:
> > > readv failed: Connection reset by peer (104)
> > > [smoky14:03829] [[60302,0],0,0]-[[60302,1],0,0] mca_oob_tcp_msg_recv:
> > > readv failed: Connection reset by peer (104)
> > > [smoky14:03829] [[60302,0],0,0] ORTE_ERROR_LOG: Data unpack would read
> > > past end of buffer in file errmgr_hnp.c at line 824
> > > [smoky14:03829] [[60302,0],0,0] ORTE_ERROR_LOG: Data unpack would read
> > > past end of buffer in file orted/orted_comm.c at line 1341
> > > mpirun: abort is already in progress...hit ctrl-c again to forcibly 
> > > terminate
> > > 
> > > [jjhursey@smoky14 system] echo $?
> > > 1
> > > --------------------------------
> > > 
> > > If the HNP has processes on it, and it is the one that aborted - I get
> > > immediate return, but no callback:
> > > --------------------------------
> > > [jjhursey@smoky14 system] mpirun -np 4 --npernode 4 orte_abort
> > > orte_abort: Name [[60292,1],0,0] Host: smoky14 Pid 3840 -- Initalized
> > > orte_abort: Name [[60292,1],1,0] Host: smoky14 Pid 3841 -- Initalized
> > > orte_abort: Name [[60292,1],2,0] Host: smoky14 Pid 3842 -- Initalized
> > > orte_abort: Name [[60292,1],3,0] Host: smoky14 Pid 3843 -- Initalized
> > > orte_abort: Name [[60292,1],3,0] Host: smoky14 Pid 3843 -- Calling Abort
> > > [jjhursey@smoky14 system] echo $?
> > > 3
> > > --------------------------------
> > > 
> > > Any ideas on what I might be doing wrong?
> > > 
> > > I tried with both calling 'orte_errmgr.abort(ORTE_PROC_MY_NAME->vpid,
> > > NULL);' and 'kill(getpid(), SIGKILL);' and got the same behavior.
> > > 
> > > -- Josh
> > > 
> > > 
> > > 
> > > On Thu, Jun 23, 2011 at 9:58 AM, Wesley Bland <wbl...@eecs.utk.edu 
> > > (mailto:wbl...@eecs.utk.edu)> wrote:
> > > > Last reminder (I hope). RFC goes in a COB today.
> > > > Wesley
> > > > _______________________________________________
> > > > devel mailing list
> > > > de...@open-mpi.org (mailto:de...@open-mpi.org)
> > > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > 
> > > 
> > > 
> > > -- 
> > > Joshua Hursey
> > > Postdoctoral Research Associate
> > > Oak Ridge National Laboratory
> > > http://users.nccs.gov/~jjhursey
> > > _______________________________________________
> > > devel mailing list
> > > de...@open-mpi.org (mailto:de...@open-mpi.org)
> > > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> > > 
> > > Attachments: 
> > > - orte_abort.c
> > > 
> > > 
> > 
> > 
> > <orte_abort.c>_______________________________________________
> > devel mailing list
> > de...@open-mpi.org (mailto:de...@open-mpi.org)
> > http://www.open-mpi.org/mailman/listinfo.cgi/devel
> _______________________________________________
> devel mailing list
> de...@open-mpi.org (mailto:de...@open-mpi.org)
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to