Hi George,
On Thursday 19 January 2006 17:22, George Bosilca wrote:
> I was hopping my patch solve the problem completely ... look like
> it's not the case :( How exactly you get the dead-lock in the
> mpi_test_suite ? Which configure options ? Only --enable-progress-
> threads ?
This happens with both --enable-progress-threads and an additional 
--enable-mpi-threads

Both hang in the same places:
Process 0:
#4  0x40315a56 in pthread_cond_wait@@GLIBC_2.3.2 () 
from /lib/tls/libpthread.so.0
#5  0x40222513 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/libc.so.6
#6  0x4007d7a2 in opal_condition_wait (c=0x4013c6c0, m=0x4013c720) at 
condition.h:64
#7  0x4007d40b in ompi_request_wait_all (count=1, requests=0x80bc1c0, 
statuses=0x0) at req_wait.c:159
#8  0x4073083f in ompi_coll_tuned_bcast_intra_basic_linear (buff=0x80c9c90, 
count=1000, datatype=0x8061de8, root=0, comm=0x80627e0) at 
coll_tuned_bcast.c:762
#9  0x4072b002 in ompi_coll_tuned_bcast_intra_dec_fixed (buff=0x80c9c90, 
count=1000, datatype=0x8061de8, root=0, comm=0x80627e0) at 
coll_tuned_decision_fixed.c:175
#10 0x40083dae in PMPI_Bcast (buffer=0x80c9c90, count=1000, 
datatype=0x8061de8, root=0, comm=0x80627e0) at pbcast.c:88
#11 0x0804f2cf in tst_coll_bcast_run (env=0xbfffeac0) at tst_coll_bcast.c:74
#12 0x0804bf21 in tst_test_run_func (env=0xbfffeac0) at tst_tests.c:377
#13 0x0804a46a in main (argc=7, argv=0xbfffeb74) at mpi_test_suite.c:319


Process 1:
#4  0x40315a56 in pthread_cond_wait@@GLIBC_2.3.2 () 
from /lib/tls/libpthread.so.0
#5  0x40222513 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/libc.so.6
#6  0x406941e3 in opal_condition_wait (c=0x4013c6c0, m=0x4013c720) at 
condition.h:64
#7  0x406939f2 in mca_pml_ob1_recv (addr=0x80c9c58, count=1000, 
datatype=0x8061de8, src=0, tag=-17, comm=0x80627e0, status=0x0) at 
pml_ob1_irecv.c:96
#8  0x407307a4 in ompi_coll_tuned_bcast_intra_basic_linear (buff=0x80c9c58, 
count=1000, datatype=0x8061de8, root=0, comm=0x80627e0) at 
coll_tuned_bcast.c:729
#9  0x4072b002 in ompi_coll_tuned_bcast_intra_dec_fixed (buff=0x80c9c58, 
count=1000, datatype=0x8061de8, root=0, comm=0x80627e0) at 
coll_tuned_decision_fixed.c:175
#10 0x40083dae in PMPI_Bcast (buffer=0x80c9c58, count=1000, 
datatype=0x8061de8, root=0, comm=0x80627e0) at pbcast.c:88
#11 0x0804f2cf in tst_coll_bcast_run (env=0xbfffeac0) at tst_coll_bcast.c:74
#12 0x0804bf21 in tst_test_run_func (env=0xbfffeac0) at tst_tests.c:377
#13 0x0804a46a in main (argc=7, argv=0xbfffeb74) at mpi_test_suite.c:319



And yes, when I run with the basic-coll, we also hang ,-]

mpirun -np 2 --mca coll basic ./mpi_test_suite -r FULL -c MPI_COMM_WORLD -d 
MPI_INT

#4  0x40315a56 in pthread_cond_wait@@GLIBC_2.3.2 () 
from /lib/tls/libpthread.so.0
#5  0x40222513 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib/tls/libc.so.6
#6  0x406941e3 in opal_condition_wait (c=0x4013c6c0, m=0x4013c720) at 
condition.h:64
#7  0x406939f2 in mca_pml_ob1_recv (addr=0x80c4ca0, count=1000, 
datatype=0x8061de8, src=0, tag=-17, comm=0x80627e0, status=0x0) at 
pml_ob1_irecv.c:96
#8  0x4070e402 in mca_coll_basic_bcast_lin_intra (buff=0x80c4ca0, count=1000, 
datatype=0x8061de8, root=0, comm=0x80627e0) at coll_basic_bcast.c:57
#9  0x40083dae in PMPI_Bcast (buffer=0x80c4ca0, count=1000, 
datatype=0x8061de8, root=0, comm=0x80627e0) at pbcast.c:88
#10 0x0804f2cf in tst_coll_bcast_run (env=0xbfffeab0) at tst_coll_bcast.c:74
#11 0x0804bf21 in tst_test_run_func (env=0xbfffeab0) at tst_tests.c:377
#12 0x0804a46a in main (argc=7, argv=0xbfffeb64) at mpi_test_suite.c:319


Now, for what its worth, I ran with helgrind, to check for possible 
race-conditions, and it spews out:
==20240== Possible data race writing variable at 0x1D84F46C
==20240==    at 0x1DA8BE61: mca_oob_tcp_recv (oob_tcp_recv.c:129)
==20240==    by 0x1D73A636: mca_oob_recv_packed (oob_base_recv.c:69)
==20240==    by 0x1D73B2B0: mca_oob_xcast (oob_base_xcast.c:133)
==20240==    by 0x1D511138: ompi_mpi_init (ompi_mpi_init.c:421)
==20240==  Address 0x1D84F46C is 1020 bytes inside a block of size 3168 
alloc'd by thread 1
==20240==    at 0x1D4A80B4: malloc 
(in /usr/lib/valgrind/vgpreload_helgrind.so)
==20240==    by 0x1D7DF7BE: opal_free_list_grow (opal_free_list.c:94)
==20240==    by 0x1D7DF754: opal_free_list_init (opal_free_list.c:79)
==20240==    by 0x1DA815E3: mca_oob_tcp_component_init (oob_tcp.c:530)


So, this was my initial search for whether we may have races in 
opal/mpi_free_list....

CU,
Rainer
-- 
---------------------------------------------------------------------
Dipl.-Inf. Rainer Keller       email: kel...@hlrs.de
  High Performance Computing     Tel: ++49 (0)711-685 5858
    Center Stuttgart (HLRS)        Fax: ++49 (0)711-685 5832
  POSTAL:Nobelstrasse 19             http://www.hlrs.de/people/keller
  ACTUAL:Allmandring 30, R. O.030      AIM:rusraink
  70550 Stuttgart

Reply via email to