Hi George, On Thursday 19 January 2006 17:22, George Bosilca wrote: > I was hopping my patch solve the problem completely ... look like > it's not the case :( How exactly you get the dead-lock in the > mpi_test_suite ? Which configure options ? Only --enable-progress- > threads ? This happens with both --enable-progress-threads and an additional --enable-mpi-threads
Both hang in the same places: Process 0: #4 0x40315a56 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0 #5 0x40222513 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/libc.so.6 #6 0x4007d7a2 in opal_condition_wait (c=0x4013c6c0, m=0x4013c720) at condition.h:64 #7 0x4007d40b in ompi_request_wait_all (count=1, requests=0x80bc1c0, statuses=0x0) at req_wait.c:159 #8 0x4073083f in ompi_coll_tuned_bcast_intra_basic_linear (buff=0x80c9c90, count=1000, datatype=0x8061de8, root=0, comm=0x80627e0) at coll_tuned_bcast.c:762 #9 0x4072b002 in ompi_coll_tuned_bcast_intra_dec_fixed (buff=0x80c9c90, count=1000, datatype=0x8061de8, root=0, comm=0x80627e0) at coll_tuned_decision_fixed.c:175 #10 0x40083dae in PMPI_Bcast (buffer=0x80c9c90, count=1000, datatype=0x8061de8, root=0, comm=0x80627e0) at pbcast.c:88 #11 0x0804f2cf in tst_coll_bcast_run (env=0xbfffeac0) at tst_coll_bcast.c:74 #12 0x0804bf21 in tst_test_run_func (env=0xbfffeac0) at tst_tests.c:377 #13 0x0804a46a in main (argc=7, argv=0xbfffeb74) at mpi_test_suite.c:319 Process 1: #4 0x40315a56 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0 #5 0x40222513 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/libc.so.6 #6 0x406941e3 in opal_condition_wait (c=0x4013c6c0, m=0x4013c720) at condition.h:64 #7 0x406939f2 in mca_pml_ob1_recv (addr=0x80c9c58, count=1000, datatype=0x8061de8, src=0, tag=-17, comm=0x80627e0, status=0x0) at pml_ob1_irecv.c:96 #8 0x407307a4 in ompi_coll_tuned_bcast_intra_basic_linear (buff=0x80c9c58, count=1000, datatype=0x8061de8, root=0, comm=0x80627e0) at coll_tuned_bcast.c:729 #9 0x4072b002 in ompi_coll_tuned_bcast_intra_dec_fixed (buff=0x80c9c58, count=1000, datatype=0x8061de8, root=0, comm=0x80627e0) at coll_tuned_decision_fixed.c:175 #10 0x40083dae in PMPI_Bcast (buffer=0x80c9c58, count=1000, datatype=0x8061de8, root=0, comm=0x80627e0) at pbcast.c:88 #11 0x0804f2cf in tst_coll_bcast_run (env=0xbfffeac0) at tst_coll_bcast.c:74 #12 0x0804bf21 in tst_test_run_func (env=0xbfffeac0) at tst_tests.c:377 #13 0x0804a46a in main (argc=7, argv=0xbfffeb74) at mpi_test_suite.c:319 And yes, when I run with the basic-coll, we also hang ,-] mpirun -np 2 --mca coll basic ./mpi_test_suite -r FULL -c MPI_COMM_WORLD -d MPI_INT #4 0x40315a56 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/libpthread.so.0 #5 0x40222513 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/libc.so.6 #6 0x406941e3 in opal_condition_wait (c=0x4013c6c0, m=0x4013c720) at condition.h:64 #7 0x406939f2 in mca_pml_ob1_recv (addr=0x80c4ca0, count=1000, datatype=0x8061de8, src=0, tag=-17, comm=0x80627e0, status=0x0) at pml_ob1_irecv.c:96 #8 0x4070e402 in mca_coll_basic_bcast_lin_intra (buff=0x80c4ca0, count=1000, datatype=0x8061de8, root=0, comm=0x80627e0) at coll_basic_bcast.c:57 #9 0x40083dae in PMPI_Bcast (buffer=0x80c4ca0, count=1000, datatype=0x8061de8, root=0, comm=0x80627e0) at pbcast.c:88 #10 0x0804f2cf in tst_coll_bcast_run (env=0xbfffeab0) at tst_coll_bcast.c:74 #11 0x0804bf21 in tst_test_run_func (env=0xbfffeab0) at tst_tests.c:377 #12 0x0804a46a in main (argc=7, argv=0xbfffeb64) at mpi_test_suite.c:319 Now, for what its worth, I ran with helgrind, to check for possible race-conditions, and it spews out: ==20240== Possible data race writing variable at 0x1D84F46C ==20240== at 0x1DA8BE61: mca_oob_tcp_recv (oob_tcp_recv.c:129) ==20240== by 0x1D73A636: mca_oob_recv_packed (oob_base_recv.c:69) ==20240== by 0x1D73B2B0: mca_oob_xcast (oob_base_xcast.c:133) ==20240== by 0x1D511138: ompi_mpi_init (ompi_mpi_init.c:421) ==20240== Address 0x1D84F46C is 1020 bytes inside a block of size 3168 alloc'd by thread 1 ==20240== at 0x1D4A80B4: malloc (in /usr/lib/valgrind/vgpreload_helgrind.so) ==20240== by 0x1D7DF7BE: opal_free_list_grow (opal_free_list.c:94) ==20240== by 0x1D7DF754: opal_free_list_init (opal_free_list.c:79) ==20240== by 0x1DA815E3: mca_oob_tcp_component_init (oob_tcp.c:530) So, this was my initial search for whether we may have races in opal/mpi_free_list.... CU, Rainer -- --------------------------------------------------------------------- Dipl.-Inf. Rainer Keller email: kel...@hlrs.de High Performance Computing Tel: ++49 (0)711-685 5858 Center Stuttgart (HLRS) Fax: ++49 (0)711-685 5832 POSTAL:Nobelstrasse 19 http://www.hlrs.de/people/keller ACTUAL:Allmandring 30, R. O.030 AIM:rusraink 70550 Stuttgart