On Sun, Apr 27, 2008 at 07:00:57PM +0300, Lenny Verkhovsky wrote: > Hi, all > > I faced the "Unbelievable situation" The situation is believable, but commit r18274, that adds this output, is not, as it doesn't take into account sequence number wrap around.
> > during running IMB benchmark. > > > > > > /home/USERS/lenny/OMPI_ORTE_LMC/bin/mpirun -np 96 --bynode -hostfile > hostfile_ompi -mca btl_openib_max_lmc 1 ./IMB-MPI1 PingPong PingPing > Sendrecv Exchange Allreduce Reduce Reduce_scatter Bcast Barrier > > > > > > > > #---------------------------------------------------------------- > > # Benchmarking Allreduce > > # #processes = 96 > > #---------------------------------------------------------------- > > #Benchmarking #procs #bytes #repetitions t_min[usec] > t_max[usec] t_avg[usec] > > Allreduce 96 0 1000 0.02 > 0.03 0.02 > > Allreduce 96 4 1000 297.88 > 298.07 297.95 > > Allreduce 96 8 1000 296.15 > 296.32 296.24 > > Allreduce 96 16 1000 297.99 > 298.17 298.09 > > Allreduce 96 32 1000 296.97 > 297.20 297.04 > > Allreduce 96 64 1000 298.43 > 298.64 298.49 > > Allreduce 96 128 1000 296.86 > 297.07 296.93 > > Allreduce 96 256 1000 298.00 > 298.30 298.09 > > Allreduce 96 512 1000 296.79 > 296.96 296.85 > > Allreduce 96 1024 1000 299.23 > 299.39 299.31 > > Allreduce 96 2048 1000 295.51 > 295.64 295.57 > > Allreduce 96 4096 1000 246.02 > 246.13 246.08 > > Allreduce 96 8192 1000 492.52 > 492.74 492.63 > > Allreduce 96 16384 1000 5380.59 > 5381.47 5381.10 > > Allreduce 96 32768 1000 5372.86 > 5373.69 5373.36 > > Allreduce 96 65536 640 5470.41 > 5471.88 5471.16 > > Allreduce 96 131072 320 5554.52 > 5556.82 5555.75 > > [witch24:15639] Unbelievable situation ... we got a duplicated fragment > with seq number of 0 (expected 65534) from witch23 > > [witch24:15639] Unbelievable situation ... we got a duplicated fragment > with seq number of 65116 (expected 65534) from witch23 > > [witch24:15639] *** Process received signal *** > > [witch24:15639] Signal: Segmentation fault (11) > > [witch24:15639] Signal code: Address not mapped (1) > > [witch24:15639] Failing at address: 0x632457d0 > > [witch24:15639] [ 0] /lib64/libpthread.so.0 [0x2b7929a9bc10] > > [witch24:15639] [ 1] > /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_allocator_bucket.so > [0x2b792aa47d34] > > [witch24:15639] [ 2] > /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_pml_ob1.so > [0x2b792b172163] > > [witch24:15639] [ 3] > /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_btl_openib.so > [0x2b792b6b0772] > > [witch24:15639] [ 4] > /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_btl_openib.so > [0x2b792b6b15ff] > > [witch24:15639] [ 5] > /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_bml_r2.so > [0x2b792b38307f] > > [witch24:15639] [ 6] > /home/USERS/lenny/OMPI_ORTE_LMC/lib/libopen-pal.so.0(opal_progress+0x4a) > [0x2b79294cd16a] > > [witch24:15639] [ 7] /home/USERS/lenny/OMPI_ORTE_LMC/lib/libmpi.so.0 > [0x2b79292163a8] > > [witch24:15639] [ 8] > /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_coll_tuned.so > [0x2b792c077cb7] > > [witch24:15639] [ 9] > /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_coll_tuned.so > [0x2b792c07b296] > > [witch24:15639] [10] > /home/USERS/lenny/OMPI_ORTE_LMC/lib/libmpi.so.0(PMPI_Allreduce+0x1e7) > [0x2b7929229907] > > [witch24:15639] [11] ./IMB-MPI1(IMB_allreduce+0x8e) [0x40764e] > > [witch24:15639] [12] ./IMB-MPI1(main+0x3aa) [0x4034ea] > > [witch24:15639] [13] /lib64/libc.so.6(__libc_start_main+0xf4) > [0x2b7929bc2154] > > [witch24:15639] [14] ./IMB-MPI1 [0x4030a9] > > [witch24:15639] *** End of error message *** > > ------------------------------------------------------------------------ > -- > > Best Regards, > > Lenny. > > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel -- Gleb.