On Tue, Aug 02, 2005 at 02:40:21PM -0500, Brian Barrett wrote: > The tree now compiles with the --enable-mpi-threads problem. There > is a bug in the event library that will cause deadlocks in orterun, > so the tree isn't exactly useful right now. Tim Woodall is going to > look into the problem. ok - thanks!
A new problem arised after compiling and running my first test-program. It simply spawns a separate thread on each rank and sends/receives 1 byte (MPI_BYTE) messages in this thread. There seems to be a race condition, sometimes, all messages are received correctly, sometimes all messages fail and the receiving rank eats up a lot of memory (>600MB) and segfaults. The backtrace is: #0 0x0015d828 in ompi_convertor_unpack (pConv=0x83569e0, iov=0x479e798, out_size=0x479e7bc, max_data=0x479e7b8, freeAfter=0x479e7b4) at convertor.c:104 #1 0x00f76af4 in mca_ptl_tcp_recv_frag_progress (frag=0x8356980) at ptl_tcp_recvfrag.h:166 #2 0x00f76124 in mca_ptl_tcp_matched (ptl=0x83321a8, frag=0x8356980) at ptl_tcp.c:302 #3 0x0090d314 in mca_pml_teg_recv_frag_match (ptl=0x8320948, frag=0x8356980, header=0x8356ab4) at pml_teg_recvfrag.c:82 #4 0x00f7bbdc in mca_ptl_tcp_recv_frag_handler (frag=0x8356a94, sd=12) at ptl_tcp_recvfrag.c:107 #5 0x00f7a20f in mca_ptl_tcp_peer_recv_handler (sd=12, flags=2, user=0x836b628) at ptl_tcp_peer.c:606 #6 0x002a8ff8 in opal_event_process_active () at event.c:453 #7 0x002a92e3 in opal_event_loop (flags=2) at event.c:543 #8 0x002b733b in opal_progress () at opal_progress.c:211 #9 0x00909295 in opal_condition_wait (c=0x23bc80, m=0x23bce0) at condition.h:66 #10 0x00908a93 in mca_pml_teg_recv (addr=0x479ea94, count=1, datatype=0x804a4a8, src=-1, tag=100002, comm=0x804a5f0, status=0x8380108) at pml_teg_irecv.c:100 #11 0x001bc50f in PMPI_Recv (buf=0x479ea94, count=1, type=0x804a4a8, source=-1, tag=100002, comm=0x804a5f0, status=0x8380108) at precv.c:66 #12 0x08048f66 in MPI_Barrier_start_worker_thread (param=0x83809f0) at nbbarr.c:76 #13 0x0072fdec in pthread_create@@GLIBC_2.1 () from /lib/tls/libpthread.so.0 #14 0x0082519a in iswctype_l () from /lib/tls/libc.so.6 The zipped corefile can be found at: http://gustav.informatik.tu-chemnitz.de/~htor/sec/core.23839.gz Any Idea or should I try to debug it? Thanks, Torsten -- bash$ :(){ :|:&};: ----- pgp: http://www.unixer.de/htor-key.asc ----- An optimist believes we live in the best of all possible worlds. A pessimist is sure of it!