It seems like we have 2 bugs here. 1. After commiting NUMA awareness we see seqf 2. Before commiting NUMA r18656 we see application hangs. 3. I checked both it with and without sendi, same results. 4. It hangs most of the times, but sometimes large msg ( >1M ) are working.
I will keep investigating :) VER=TRUNK; //home/USERS/lenny/OMPI_ORTE_${VER}/bin/mpicc -o mpi_p_${VER} /opt/vltmpi/OPENIB/mpi/examples/mpi_p.c ; /home/USERS/lenny/OMPI_ORTE_${VER}/bin/mpirun -np 100 -hostfile hostfile_w ./mpi_p_${VER} -t bw -s 4000000 [witch17:09798] *** Process received signal *** [witch17:09798] Signal: Segmentation fault (11) [witch17:09798] Signal code: Address not mapped (1) [witch17:09798] Failing at address: (nil) [witch17:09798] [ 0] /lib64/libpthread.so.0 [0x2b1d13530c10] [witch17:09798] [ 1] /home/USERS/lenny/OMPI_ORTE_TRUNK/lib/openmpi/mca_btl_sm.so [0x2b1d1557a68a] [witch17:09798] [ 2] /home/USERS/lenny/OMPI_ORTE_TRUNK/lib/openmpi/mca_bml_r2.so [0x2b1d14e1b12f] [witch17:09798] [ 3] /home/USERS/lenny/OMPI_ORTE_TRUNK/lib/libopen-pal.so.0(opal_progress+0x5a) [0x2b1d12f6a6da] [witch17:09798] [ 4] /home/USERS/lenny/OMPI_ORTE_TRUNK/lib/libmpi.so.0 [0x2b1d12cafd28] [witch17:09798] [ 5] /home/USERS/lenny/OMPI_ORTE_TRUNK/lib/libmpi.so.0(PMPI_Waitall+0x91) [0x2b1d12cd9d71] [witch17:09798] [ 6] ./mpi_p_TRUNK(main+0xd32) [0x401ca2] [witch17:09798] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4) [0x2b1d13657154] [witch17:09798] [ 8] ./mpi_p_TRUNK [0x400ea9] [witch17:09798] *** End of error message *** [witch1:24955] -------------------------------------------------------------------------- mpirun noticed that process rank 62 with PID 9798 on node witch17 exited on signal 11 (Segmentation fault). -------------------------------------------------------------------------- witch1:/home/USERS/lenny/TESTS/NUMA # witch1:/home/USERS/lenny/TESTS/NUMA # witch1:/home/USERS/lenny/TESTS/NUMA # witch1:/home/USERS/lenny/TESTS/NUMA # VER=18551; //home/USERS/lenny/OMPI_ORTE_${VER}/bin/mpicc -o mpi_p_${VER} /opt/vltmpi/OPENIB/mpi/examples/mpi_p.c ; /home/USERS/lenny/OMPI_ORTE_${VER}/bin/mpirun -np 100 -hostfile hostfile_w ./mpi_p_${VER} -t bw -s 4000000 BW (100) (size min max avg) 4000000 654.496755 2121.899985 1156.171067 witch1:/home/USERS/lenny/TESTS/NUMA # On Tue, Jun 17, 2008 at 2:10 PM, George Bosilca <bosi...@eecs.utk.edu> wrote: > Lenny, > > I guess you're running the latest version. If not, please update, Galen and > myself corrected some bugs last week. If you're using the latest (and > greatest) then ... well I imagine there is at least one bug left. > > There is a quick test you can do. In the btl_sm.c in the module structure > at the beginning of the file, please replace the sendi function by NULL. If > this fix the problem, then at least we know that it's a sm send immediate > problem. > > Thanks, > george. > > > On Jun 17, 2008, at 7:54 AM, Lenny Verkhovsky wrote: > > Hi, George, >> >> I have a problem running BW benchmark on 100 rank cluster after r18551. >> The BW is mpi_p that runs mpi_bandwidth with 100K between all pairs. >> >> >> #mpirun -np 100 -hostfile hostfile_w ./mpi_p_18549 -t bw -s 100000 >> BW (100) (size min max avg) 100000 576.734030 2001.882416 >> 1062.698408 >> #mpirun -np 100 -hostfile hostfile_w ./mpi_p_18551 -t bw -s 100000 >> mpirun: killing job... >> ( it hangs even after 10 hours ). >> >> >> It doesn't happen if I run --bynode or btl openib,self only. >> >> >> Lenny. >> > >