On Tue, Aug 02, 2005 at 02:40:21PM -0500, Brian Barrett wrote:
> The tree now compiles with the --enable-mpi-threads problem.  There  
> is a bug in the event library that will cause deadlocks in orterun,  
> so the tree isn't exactly useful right now.  Tim Woodall is going to  
> look into the problem.
ok - thanks!

A new problem arised after compiling and running my first test-program.
It simply spawns a separate thread on each rank and sends/receives 1
byte (MPI_BYTE) messages in this thread. There seems to be a race
condition, sometimes, all messages are received correctly, sometimes all
messages fail and the receiving rank eats up a lot of memory (>600MB)
and segfaults. 

The backtrace is:
#0  0x0015d828 in ompi_convertor_unpack (pConv=0x83569e0, iov=0x479e798, 
    out_size=0x479e7bc, max_data=0x479e7b8, freeAfter=0x479e7b4)
    at convertor.c:104
#1  0x00f76af4 in mca_ptl_tcp_recv_frag_progress (frag=0x8356980)
    at ptl_tcp_recvfrag.h:166
#2  0x00f76124 in mca_ptl_tcp_matched (ptl=0x83321a8, frag=0x8356980)
    at ptl_tcp.c:302
#3  0x0090d314 in mca_pml_teg_recv_frag_match (ptl=0x8320948, frag=0x8356980, 
    header=0x8356ab4) at pml_teg_recvfrag.c:82
#4  0x00f7bbdc in mca_ptl_tcp_recv_frag_handler (frag=0x8356a94, sd=12)
    at ptl_tcp_recvfrag.c:107
#5  0x00f7a20f in mca_ptl_tcp_peer_recv_handler (sd=12, flags=2, 
    user=0x836b628) at ptl_tcp_peer.c:606
#6  0x002a8ff8 in opal_event_process_active () at event.c:453
#7  0x002a92e3 in opal_event_loop (flags=2) at event.c:543
#8  0x002b733b in opal_progress () at opal_progress.c:211
#9  0x00909295 in opal_condition_wait (c=0x23bc80, m=0x23bce0)
    at condition.h:66
#10 0x00908a93 in mca_pml_teg_recv (addr=0x479ea94, count=1, 
    datatype=0x804a4a8, src=-1, tag=100002, comm=0x804a5f0, status=0x8380108)
    at pml_teg_irecv.c:100
#11 0x001bc50f in PMPI_Recv (buf=0x479ea94, count=1, type=0x804a4a8, 
    source=-1, tag=100002, comm=0x804a5f0, status=0x8380108) at precv.c:66
#12 0x08048f66 in MPI_Barrier_start_worker_thread (param=0x83809f0)
    at nbbarr.c:76
#13 0x0072fdec in pthread_create@@GLIBC_2.1 () from /lib/tls/libpthread.so.0
#14 0x0082519a in iswctype_l () from /lib/tls/libc.so.6

The zipped corefile can be found at:
http://gustav.informatik.tu-chemnitz.de/~htor/sec/core.23839.gz

Any Idea or should I try to debug it?

Thanks,
   Torsten

-- 
 bash$ :(){ :|:&};: ----- pgp: http://www.unixer.de/htor-key.asc -----
An optimist believes we live in the best of all possible worlds.  
A pessimist is sure of it!

Reply via email to