It seems like we have 2 bugs here.
1. After commiting NUMA awareness we see seqf
2. Before commiting NUMA r18656 we see application hangs.
3. I checked both it with and without sendi, same results.
4. It hangs most of the times, but sometimes large msg ( >1M ) are working.


I will keep investigating :)


VER=TRUNK; //home/USERS/lenny/OMPI_ORTE_${VER}/bin/mpicc -o mpi_p_${VER}
/opt/vltmpi/OPENIB/mpi/examples/mpi_p.c ;
/home/USERS/lenny/OMPI_ORTE_${VER}/bin/mpirun -np 100 -hostfile hostfile_w
./mpi_p_${VER} -t bw -s 4000000
[witch17:09798] *** Process received signal ***
[witch17:09798] Signal: Segmentation fault (11)
[witch17:09798] Signal code: Address not mapped (1)
[witch17:09798] Failing at address: (nil)
[witch17:09798] [ 0] /lib64/libpthread.so.0 [0x2b1d13530c10]
[witch17:09798] [ 1]
/home/USERS/lenny/OMPI_ORTE_TRUNK/lib/openmpi/mca_btl_sm.so [0x2b1d1557a68a]
[witch17:09798] [ 2]
/home/USERS/lenny/OMPI_ORTE_TRUNK/lib/openmpi/mca_bml_r2.so [0x2b1d14e1b12f]
[witch17:09798] [ 3]
/home/USERS/lenny/OMPI_ORTE_TRUNK/lib/libopen-pal.so.0(opal_progress+0x5a)
[0x2b1d12f6a6da]
[witch17:09798] [ 4] /home/USERS/lenny/OMPI_ORTE_TRUNK/lib/libmpi.so.0
[0x2b1d12cafd28]
[witch17:09798] [ 5]
/home/USERS/lenny/OMPI_ORTE_TRUNK/lib/libmpi.so.0(PMPI_Waitall+0x91)
[0x2b1d12cd9d71]
[witch17:09798] [ 6] ./mpi_p_TRUNK(main+0xd32) [0x401ca2]
[witch17:09798] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4)
[0x2b1d13657154]
[witch17:09798] [ 8] ./mpi_p_TRUNK [0x400ea9]
[witch17:09798] *** End of error message ***
[witch1:24955]
--------------------------------------------------------------------------
mpirun noticed that process rank 62 with PID 9798 on node witch17 exited on
signal 11 (Segmentation fault).
--------------------------------------------------------------------------
witch1:/home/USERS/lenny/TESTS/NUMA #
witch1:/home/USERS/lenny/TESTS/NUMA #
witch1:/home/USERS/lenny/TESTS/NUMA #
witch1:/home/USERS/lenny/TESTS/NUMA # VER=18551;
//home/USERS/lenny/OMPI_ORTE_${VER}/bin/mpicc -o mpi_p_${VER}
/opt/vltmpi/OPENIB/mpi/examples/mpi_p.c ;
/home/USERS/lenny/OMPI_ORTE_${VER}/bin/mpirun -np 100 -hostfile hostfile_w
./mpi_p_${VER} -t bw -s 4000000
BW (100) (size min max avg)  4000000    654.496755      2121.899985
1156.171067
witch1:/home/USERS/lenny/TESTS/NUMA
#




On Tue, Jun 17, 2008 at 2:10 PM, George Bosilca <bosi...@eecs.utk.edu>
wrote:

> Lenny,
>
> I guess you're running the latest version. If not, please update, Galen and
> myself corrected some bugs last week. If you're using the latest (and
> greatest) then ... well I imagine there is at least one bug left.
>
> There is a quick test you can do. In the btl_sm.c in the module structure
> at the beginning of the file, please replace the sendi function by NULL. If
> this fix the problem, then at least we know that it's a sm send immediate
> problem.
>
>  Thanks,
>    george.
>
>
> On Jun 17, 2008, at 7:54 AM, Lenny Verkhovsky wrote:
>
> Hi, George,
>>
>> I have a problem running BW benchmark on 100 rank cluster after r18551.
>> The BW is mpi_p that runs mpi_bandwidth with 100K between all pairs.
>>
>>
>> #mpirun -np 100 -hostfile hostfile_w  ./mpi_p_18549 -t bw -s 100000
>> BW (100) (size min max avg)  100000     576.734030      2001.882416
>> 1062.698408
>> #mpirun -np 100 -hostfile hostfile_w ./mpi_p_18551 -t bw -s 100000
>> mpirun: killing job...
>> ( it hangs even after 10 hours ).
>>
>>
>> It doesn't happen if I run --bynode or btl openib,self only.
>>
>>
>> Lenny.
>>
>
>

Reply via email to