On Sun, Apr 27, 2008 at 07:00:57PM +0300, Lenny Verkhovsky wrote:
> Hi, all 
> 
> I faced the "Unbelievable situation"
The situation is believable, but commit r18274, that adds this output, is
not, as it doesn't take into account sequence number wrap around.

> 
> during running IMB benchmark.
> 
>  
> 
>  
> 
> /home/USERS/lenny/OMPI_ORTE_LMC/bin/mpirun -np 96 --bynode  -hostfile
> hostfile_ompi -mca btl_openib_max_lmc 1 ./IMB-MPI1 PingPong PingPing
> Sendrecv Exchange Allreduce Reduce Reduce_scatter Bcast Barrier
> 
>  
> 
>  
> 
>  
> 
> #----------------------------------------------------------------
> 
> # Benchmarking Allreduce
> 
> # #processes = 96
> 
> #----------------------------------------------------------------
> 
> #Benchmarking        #procs       #bytes #repetitions  t_min[usec]
> t_max[usec]  t_avg[usec]
> 
> Allreduce       96                  0         1000         0.02
> 0.03         0.02
> 
> Allreduce       96                  4         1000       297.88
> 298.07       297.95
> 
> Allreduce       96                  8         1000       296.15
> 296.32       296.24
> 
> Allreduce       96                 16         1000       297.99
> 298.17       298.09
> 
> Allreduce       96                 32         1000       296.97
> 297.20       297.04
> 
> Allreduce       96                 64         1000       298.43
> 298.64       298.49
> 
> Allreduce       96                128         1000       296.86
> 297.07       296.93
> 
> Allreduce       96                256         1000       298.00
> 298.30       298.09
> 
> Allreduce       96                512         1000       296.79
> 296.96       296.85
> 
> Allreduce       96               1024         1000       299.23
> 299.39       299.31
> 
> Allreduce       96               2048         1000       295.51
> 295.64       295.57
> 
> Allreduce       96               4096         1000       246.02
> 246.13       246.08
> 
> Allreduce       96               8192         1000       492.52
> 492.74       492.63
> 
> Allreduce       96              16384         1000      5380.59
> 5381.47      5381.10
> 
> Allreduce       96              32768         1000      5372.86
> 5373.69      5373.36
> 
> Allreduce       96              65536          640      5470.41
> 5471.88      5471.16
> 
> Allreduce       96             131072          320      5554.52
> 5556.82      5555.75
> 
> [witch24:15639] Unbelievable situation ... we got a duplicated fragment
> with seq number of 0 (expected 65534) from witch23
> 
> [witch24:15639] Unbelievable situation ... we got a duplicated fragment
> with seq number of 65116 (expected 65534) from witch23
> 
> [witch24:15639] *** Process received signal ***
> 
> [witch24:15639] Signal: Segmentation fault (11)
> 
> [witch24:15639] Signal code: Address not mapped (1)
> 
> [witch24:15639] Failing at address: 0x632457d0
> 
> [witch24:15639] [ 0] /lib64/libpthread.so.0 [0x2b7929a9bc10]
> 
> [witch24:15639] [ 1]
> /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_allocator_bucket.so
> [0x2b792aa47d34]
> 
> [witch24:15639] [ 2]
> /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_pml_ob1.so
> [0x2b792b172163]
> 
> [witch24:15639] [ 3]
> /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_btl_openib.so
> [0x2b792b6b0772]
> 
> [witch24:15639] [ 4]
> /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_btl_openib.so
> [0x2b792b6b15ff]
> 
> [witch24:15639] [ 5]
> /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_bml_r2.so
> [0x2b792b38307f]
> 
> [witch24:15639] [ 6]
> /home/USERS/lenny/OMPI_ORTE_LMC/lib/libopen-pal.so.0(opal_progress+0x4a)
> [0x2b79294cd16a]
> 
> [witch24:15639] [ 7] /home/USERS/lenny/OMPI_ORTE_LMC/lib/libmpi.so.0
> [0x2b79292163a8]
> 
> [witch24:15639] [ 8]
> /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_coll_tuned.so
> [0x2b792c077cb7]
> 
> [witch24:15639] [ 9]
> /home/USERS/lenny/OMPI_ORTE_LMC/lib/openmpi/mca_coll_tuned.so
> [0x2b792c07b296]
> 
> [witch24:15639] [10]
> /home/USERS/lenny/OMPI_ORTE_LMC/lib/libmpi.so.0(PMPI_Allreduce+0x1e7)
> [0x2b7929229907]
> 
> [witch24:15639] [11] ./IMB-MPI1(IMB_allreduce+0x8e) [0x40764e]
> 
> [witch24:15639] [12] ./IMB-MPI1(main+0x3aa) [0x4034ea]
> 
> [witch24:15639] [13] /lib64/libc.so.6(__libc_start_main+0xf4)
> [0x2b7929bc2154]
> 
> [witch24:15639] [14] ./IMB-MPI1 [0x4030a9]
> 
> [witch24:15639] *** End of error message ***
> 
> ------------------------------------------------------------------------
> --
> 
> Best Regards,
> 
> Lenny.
> 
>  
> 

> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
                        Gleb.

Reply via email to