Ummm...just to clarify. There are no threads in ORTE, so it wasn't a problem of 
"not giving up the thread". The problem was that Josh's test never called 
progress. It would have been equally okay to simply call "opal_event_dispatch" 
while waiting for the callback.

All applications have to cycle the progress engine.


On Jun 23, 2011, at 1:18 PM, Wesley Bland wrote:

> Josh,
> 
> There were a couple of bugs that I cleared up in my most recent checkin, but 
> I also needed to modify your test. The callback for the application layer 
> errmgr actually occurs in the application layer. Your test was never giving 
> up the thread to the ORTE application event loop to receive its message from 
> the ORTED. I changed your while loop to an ORTE_PROGRESSED_WAIT and that 
> fixed the problem.
> 
> Try running the attached code with the modifications and see if that clears 
> up the problem. It did for me.
> 
> Thanks,
> Wesley
> On Thursday, June 23, 2011 at 10:16 AM, Josh Hursey wrote:
> 
>> So I finally got a chance to test the branch this morning. I cannot
>> get it to work. Maybe I'm doing some wrong, missing some MCA
>> parameter?
>> 
>> -------------------------
>> [jjhursey@smoky-login1 resilient-orte] hg summary
>> parent: 2:c550cf6ed6a2 tip
>> Newest version. Synced with trunk r24785.
>> branch: default
>> commit: 1 modified, 8097 unknown
>> update: (current)
>> -------------------------
>> (the 1 modified was the test program attached)
>> 
>> Attached is a modified version of the orte_abort.c program found in
>> ${top}/orte/test/system. This program is ORTE only, and registers the
>> errmgr callback to trigger correct termination. You will need to
>> configure Open MPI with '--with-devel-headers' to build this. But then
>> you can compile with:
>> ortecc -g orte_abort.c -o orte_abort
>> 
>> These are the configure options that I used:
>> --with-devel-headers --enable-binaries --disable-io-romio
>> --enable-contrib-no-build=vt --enable-debug CC=gcc CXX=g++
>> F77=gfortran FC=gfortran
>> 
>> 
>> If the HNP has no processes on it - I get a hang:
>> -------------------------------
>> mpirun -np 4 --nolocal orte_abort
>> orte_abort: Name [[60121,1],0,0] Host: smoky13 Pid 3688 -- Initalized
>> orte_abort: Name [[60121,1],1,0] Host: smoky13 Pid 3689 -- Initalized
>> orte_abort: Name [[60121,1],2,0] Host: smoky13 Pid 3690 -- Initalized
>> orte_abort: Name [[60121,1],3,0] Host: smoky13 Pid 3691 -- Initalized
>> orte_abort: Name [[60121,1],3,0] Host: smoky13 Pid 3691 -- Calling Abort
>> mpirun: killing job...
>> 
>> [smoky14:04002] [[60121,0],0,0] ORTE_ERROR_LOG: Data unpack would read
>> past end of buffer in file errmgr_hnp.c at line 824
>> [smoky14:04002] [[60121,0],0,0] ORTE_ERROR_LOG: Data unpack would read
>> past end of buffer in file orted/orted_comm.c at line 1341
>> mpirun: abort is already in progress...hit ctrl-c again to forcibly terminate
>> 
>> [jjhursey@smoky14 system] echo $?
>> 1
>> -------------------------------
>> 
>> If the HNP has processes on it, but not the one that aborted - I get a hang:
>> -------------------------------
>> [jjhursey@smoky14 system] mpirun -np 4 --npernode 2 orte_abort
>> orte_abort: Name [[60302,1],0,0] Host: smoky14 Pid 3830 -- Initalized
>> orte_abort: Name [[60302,1],1,0] Host: smoky14 Pid 3831 -- Initalized
>> orte_abort: Name [[60302,1],2,0] Host: smoky13 Pid 3484 -- Initalized
>> orte_abort: Name [[60302,1],3,0] Host: smoky13 Pid 3485 -- Initalized
>> orte_abort: Name [[60302,1],3,0] Host: smoky13 Pid 3485 -- Calling Abort
>> mpirun: killing job...
>> 
>> [smoky14:03829] [[60302,0],0,0]-[[60302,1],1,0] mca_oob_tcp_msg_recv:
>> readv failed: Connection reset by peer (104)
>> [smoky14:03829] [[60302,0],0,0]-[[60302,1],0,0] mca_oob_tcp_msg_recv:
>> readv failed: Connection reset by peer (104)
>> [smoky14:03829] [[60302,0],0,0] ORTE_ERROR_LOG: Data unpack would read
>> past end of buffer in file errmgr_hnp.c at line 824
>> [smoky14:03829] [[60302,0],0,0] ORTE_ERROR_LOG: Data unpack would read
>> past end of buffer in file orted/orted_comm.c at line 1341
>> mpirun: abort is already in progress...hit ctrl-c again to forcibly terminate
>> 
>> [jjhursey@smoky14 system] echo $?
>> 1
>> --------------------------------
>> 
>> If the HNP has processes on it, and it is the one that aborted - I get
>> immediate return, but no callback:
>> --------------------------------
>> [jjhursey@smoky14 system] mpirun -np 4 --npernode 4 orte_abort
>> orte_abort: Name [[60292,1],0,0] Host: smoky14 Pid 3840 -- Initalized
>> orte_abort: Name [[60292,1],1,0] Host: smoky14 Pid 3841 -- Initalized
>> orte_abort: Name [[60292,1],2,0] Host: smoky14 Pid 3842 -- Initalized
>> orte_abort: Name [[60292,1],3,0] Host: smoky14 Pid 3843 -- Initalized
>> orte_abort: Name [[60292,1],3,0] Host: smoky14 Pid 3843 -- Calling Abort
>> [jjhursey@smoky14 system] echo $?
>> 3
>> --------------------------------
>> 
>> Any ideas on what I might be doing wrong?
>> 
>> I tried with both calling 'orte_errmgr.abort(ORTE_PROC_MY_NAME->vpid,
>> NULL);' and 'kill(getpid(), SIGKILL);' and got the same behavior.
>> 
>> -- Josh
>> 
>> 
>> 
>> On Thu, Jun 23, 2011 at 9:58 AM, Wesley Bland <wbl...@eecs.utk.edu> wrote:
>>> Last reminder (I hope). RFC goes in a COB today.
>>> Wesley
>>> _______________________________________________
>>> devel mailing list
>>> de...@open-mpi.org
>>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> 
>> 
>> -- 
>> Joshua Hursey
>> Postdoctoral Research Associate
>> Oak Ridge National Laboratory
>> http://users.nccs.gov/~jjhursey
>> _______________________________________________
>> devel mailing list
>> de...@open-mpi.org
>> http://www.open-mpi.org/mailman/listinfo.cgi/devel
>> 
>> Attachments:
>> - orte_abort.c
> 
> <orte_abort.c>_______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

Reply via email to