The error you are seeing is usually indicative of some code operating on memory that isn't aligned properly for a SPARC instruction being used. The address that is causing the failure is odd aligned which is more than likely the culprit. If you have a core dump and can disassemble the code that is being ran at the time it probably will be some sort of instruction requiring an alignment. If the MPI you are using is something you built can you try and build OMPI with -g and get the line number in the PML that is failing?

I haven't seen this type of error for some time but I do all of my SPARC testing on Solaris with Solaris Studio Compilers. You may want to try to compile the benchmark with "-m32" to see if that helps. Though being an odd address I suspect it might not. If you can use the Studio Compilers you could try giving the compilers the -xmemalign=8i option when building the benchmark and see if that resolves the issue. This would help to assure the issue is just an alignment of data we are slicing and dicing as opposed to wrongly addressing memory.

--td

On 11/21/2011 8:51 PM, Lukas Razik wrote:
Hello everybody!

I've Sun T5120 (SPARC64) Servers with
- Debian: 6.0.3
- linux-2.6.39.4 (from kernel.org)
- OFED-1.5.3.2
- InfiniBand: Mellanox Technologies MT25418 [ConnectX VPI PCIe 2.0 2.5GT/s - IB 
DDR / 10GigE] (rev a0)
   with newest FW (2.9.1)
and the following issue:

If I try to mpirun a program like the osu_latency benchmark:
$ /usr/mpi/gcc/openmpi-1.4.3/bin/mpirun -np 2 --mca btl_base_verbose 50 --mca 
btl_openib_verbose 1 -host cluster1,cluster2 
/usr/mpi/gcc/openmpi-1.4.3/tests/osu_benchmarks-3.1.1/osu_latency

then I get these errors:
<snip>
# OSU MPI Latency Test v3.1.1
# Size            Latency (us)
[cluster1:64027] *** Process received signal ***
[cluster1:64027] Signal: Bus error (10)
[cluster1:64027] Signal code: Invalid address alignment (1)
[cluster1:64027] Failing at address: 0xaa9053
[cluster1:64027] [ 0] 
/usr/mpi/gcc/openmpi-1.4.3/lib64/openmpi/mca_pml_ob1.so(+0x62f0) 
[0xfffff8010209e2f0]
[cluster1:64027] [ 1] 
/usr/mpi/gcc/openmpi-1.4.3/lib64/openmpi/mca_coll_tuned.so(+0x2904) 
[0xfffff801031ce904]
[cluster1:64027] [ 2] 
/usr/mpi/gcc/openmpi-1.4.3/lib64/openmpi/mca_coll_tuned.so(+0xb498) 
[0xfffff801031d7498]
[cluster1:64027] [ 3] 
/usr/mpi/gcc/openmpi-1.4.3/lib64/libmpi.so.0(MPI_Barrier+0xbc) 
[0xfffff8010005a97c]
[cluster1:64027] [ 4] 
/usr/mpi/gcc/openmpi-1.4.3/tests/osu_benchmarks-3.1.1/osu_latency(main+0x2b0) 
[0x100f34]
[cluster1:64027] [ 5] /lib64/libc.so.6(__libc_start_main+0x100) 
[0xfffff80100ac1240]
[cluster1:64027] [ 6] 
/usr/mpi/gcc/openmpi-1.4.3/tests/osu_benchmarks-3.1.1/osu_latency(_start+0x2c) 
[0x100bac]
[cluster1:64027] *** End of error message ***
[cluster2:02759] *** Process received signal ***
[cluster2:02759] Signal: Bus error (10)
[cluster2:02759] Signal code: Invalid address alignment (1)
[cluster2:02759] Failing at address: 0xaa9053
[cluster2:02759] [ 0] 
/usr/mpi/gcc/openmpi-1.4.3/lib64/openmpi/mca_pml_ob1.so(+0x62f0) 
[0xfffff8010209e2f0]
[cluster2:02759] [ 1] 
/usr/mpi/gcc/openmpi-1.4.3/lib64/openmpi/mca_coll_tuned.so(+0x2904) 
[0xfffff801031ce904]
[cluster2:02759] [ 2] 
/usr/mpi/gcc/openmpi-1.4.3/lib64/openmpi/mca_coll_tuned.so(+0xb498) 
[0xfffff801031d7498]
[cluster2:02759] [ 3] 
/usr/mpi/gcc/openmpi-1.4.3/lib64/libmpi.so.0(MPI_Barrier+0xbc) 
[0xfffff8010005a97c]
[cluster2:02759] [ 4] 
/usr/mpi/gcc/openmpi-1.4.3/tests/osu_benchmarks-3.1.1/osu_latency(main+0x2b0) 
[0x100f34]
[cluster2:02759] [ 5] /lib64/libc.so.6(__libc_start_main+0x100) 
[0xfffff80100ac1240]
[cluster2:02759] [ 6] 
/usr/mpi/gcc/openmpi-1.4.3/tests/osu_benchmarks-3.1.1/osu_latency(_start+0x2c) 
[0x100bac]
[cluster2:02759] *** End of error message ***
---

The whole output can be found here:
http://net.razik.de/linux/T5120/openmpi-1.4.3-verbose.txt

That's my 'ompi_info --param all all' output:
http://net.razik.de/linux/T5120/openmpi-1.4.3_param_all_all.txt

Same error with OFED-1.5.4-rc4 and also the same with openmpi-1.4.4.

If I disable openib the I get the right results:
$ /usr/mpi/gcc/openmpi-1.4.3/bin/mpirun --mca btl ^openib -np 2 -host 
cluster1,cluster2 
/usr/mpi/gcc/openmpi-1.4.3/tests/osu_benchmarks-3.1.1/osu_latency
# OSU MPI Latency Test v3.1.1
# Size            Latency (us)
0                       143.53
1                       140.50
<snip>
---

ibverbs seems to work:
$ ibv_srq_pingpong -n 1000000 cluster2
<snip>
8192000000 bytes in 4.15 seconds = 15806.63 Mbit/sec
1000000 iters in 4.15 seconds = 4.15 usec/iter
---

These are the installed OFED packets:
kernel-ib
ofed-scripts
libibverbs
libibverbs-devel
libibverbs-utils
libmlx4
libmlx4-devel
libibumad
libibumad-devel
libibmad
libibmad-devel
librdmacm
librdmacm-utils
librdmacm-devel
opensm-libs
ibutils
infiniband-diags
qperf
ofed-docs
mpi-selector
openmpi_gcc
mpitests_openmpi_gcc
---

I don't know which mailing list is the right one and I'm very thankful for any 
help!
If you have questions, please ask!

Best regards,
Lukas


The archives of the lists I've sent this email to:
http://lists.openfabrics.org/pipermail/ewg/2011-November/thread.html
http://www.open-mpi.org/community/lists/devel/2011/11/date.php
http://thread.gmane.org/gmane.linux.drivers.rdma/

_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel

--
Oracle
Terry D. Dontje | Principal Software Engineer
Developer Tools Engineering | +1.781.442.2631
Oracle *- Performance Technologies*
95 Network Drive, Burlington, MA 01803
Email terry.don...@oracle.com <mailto:terry.don...@oracle.com>



Reply via email to