I guess it should be here, sorry. /home/USERS/lenny/OMPI_ORTE_18850/bin/mpirun -np 2 -H witch2,witch3 ./IMB-MPI1_18850 PingPong #--------------------------------------------------- # Intel (R) MPI Benchmark Suite V3.0v modified by Voltaire, MPI-1 part #--------------------------------------------------- # Date : Tue Jul 15 15:11:30 2008 # Machine : x86_64 # System : Linux # Release : 2.6.16.46-0.12-smp # Version : #1 SMP Thu May 17 14:00:09 UTC 2007 # MPI Version : 2.0 # MPI Thread Environment: MPI_THREAD_SINGLE
# # Minimum message length in bytes: 0 # Maximum message length in bytes: 67108864 # # MPI_Datatype : MPI_BYTE # MPI_Datatype for reductions : MPI_FLOAT # MPI_Op : MPI_SUM # # # List of Benchmarks to run: # PingPong [witch3:32461] *** Process received signal *** [witch3:32461] Signal: Segmentation fault (11) [witch3:32461] Signal code: Address not mapped (1) [witch3:32461] Failing at address: 0x20 [witch3:32461] [ 0] /lib64/libpthread.so.0 [0x2b514fcedc10] [witch3:32461] [ 1] /home/USERS/lenny/OMPI_ORTE_18850/lib/openmpi/mca_pml_ob1.so [0x2b51510b416a] [witch3:32461] [ 2] /home/USERS/lenny/OMPI_ORTE_18850/lib/openmpi/mca_pml_ob1.so [0x2b51510b4661] [witch3:32461] [ 3] /home/USERS/lenny/OMPI_ORTE_18850/lib/openmpi/mca_pml_ob1.so [0x2b51510b180e] [witch3:32461] [ 4] /home/USERS/lenny/OMPI_ORTE_18850/lib/openmpi/mca_btl_openib.so [0x2b5151811c22] [witch3:32461] [ 5] /home/USERS/lenny/OMPI_ORTE_18850/lib/openmpi/mca_btl_openib.so [0x2b51518132e9] [witch3:32461] [ 6] /home/USERS/lenny/OMPI_ORTE_18850/lib/openmpi/mca_bml_r2.so [0x2b51512c412f] [witch3:32461] [ 7] /home/USERS/lenny/OMPI_ORTE_18850/lib/libopen-pal.so.0(opal_progress+0x5a) [0x2b514f71268a] [witch3:32461] [ 8] /home/USERS/lenny/OMPI_ORTE_18850/lib/openmpi/mca_pml_ob1.so [0x2b51510af0f5] [witch3:32461] [ 9] /home/USERS/lenny/OMPI_ORTE_18850/lib/libmpi.so.0(PMPI_Recv+0x13b) [0x2b514f47941b] [witch3:32461] [10] ./IMB-MPI1_18850(IMB_pingpong+0x1a1) [0x4073cd] [witch3:32461] [11] ./IMB-MPI1_18850(IMB_warm_up+0x2d) [0x405e49] [witch3:32461] [12] ./IMB-MPI1_18850(main+0x394) [0x4034d4] [witch3:32461] [13] /lib64/libc.so.6(__libc_start_main+0xf4) [0x2b514fe14154] [witch3:32461] [14] ./IMB-MPI1_18850 [0x4030a9] [witch3:32461] *** End of error message *** mpirun: killing job... -------------------------------------------------------------------------- mpirun was unable to cleanly terminate the daemons on the nodes shown below. Additional manual cleanup may be required - please refer to the "orte-clean" tool for assistance. -------------------------------------------------------------------------- witch2 witch3 On 7/15/08, Pavel Shamis (Pasha) <pa...@dev.mellanox.co.il> wrote: > > > It looks like a new issue to me, Pasha. Possibly a side consequence of the >> IOF change made by Jeff and I the other day. From what I can see, it looks >> like you app was a simple "hello" - correct? >> >> > Yep, it is simple hello application. > >> If you look at the error, the problem occurs when mpirun is trying to >> route >> a message. Since the app is clearly running at this time, the problem is >> probably in the IOF. The error message shows that mpirun is attempting to >> route a message to a jobid that doesn't exist. We have a test in the RML >> that forces an "abort" if that occurs. >> >> I would guess that there is either a race condition or memory corruption >> occurring somewhere, but I have no idea where. >> >> This may be the "new hole in the dyke" I cautioned about in earlier notes >> regarding the IOF... :-) >> >> Still, given that this hits rarely, it probably is a more acceptable bug >> to >> leave in the code than the one we just fixed (duplicated stdin)... >> >> > It is not so rare issue, 19 failures in my MTT run ( > http://www.open-mpi.org/mtt/index.php?do_redir=765). > > Pasha > >> Ralph >> >> >> >> On 7/14/08 1:11 AM, "Pavel Shamis (Pasha)" <pa...@dev.mellanox.co.il> >> wrote: >> >> >> >>> Please see http://www.open-mpi.org/mtt/index.php?do_redir=764 >>> >>> The error is not consistent. It takes a lot of iteration to reproduce it. >>> In my MTT testing I seen it few times. >>> >>> Is it know issue ? >>> >>> Regards, >>> Pasha >>> _______________________________________________ >>> devel mailing list >>> de...@open-mpi.org >>> http://www.open-mpi.org/mailman/listinfo.cgi/devel >>> >>> >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel >> >> >> > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel >