Karl --
Yikes. This looks like an alignment or memory write ordering kind of
error; I have a dim recollection about doing some fixes for this, but
am on a plane at the moment and cannot check the SVN logs.
Could you try the latest 1.1.2 RC and see if the problem still occurs
for you? It's available on the general download page on the web site.
Thanks!
On Oct 7, 2006, at 7:34 PM, Karl Dockendorf wrote:
I just (yesterday) made the move from LAM/MPI to OpenMPI. The
configure / compile / install went smoothly (version 1.1.1).
However, after recompiling my source and executing it usually
crashes in MPI_INIT. Seems to be coming from the same place MOST
of the time. Usually spits out a message something like this.
Signal:10 info.si_errno:0(Unknown error: 0) si_code:1(BUS_ADRALN)
Failing at addr:0xfdff8018
*** End of error message ***
Signal:10 info.si_errno:0(Unknown error: 0) si_code:1(BUS_ADRALN)
Failing at addr:0x2807000
*** End of error message ***
The test system (before moving back to the cluster) is a G4
PowerBook with OS 10.4.8 (not using Xgrid at the moment). I'm
oversubscribing it (2 processes, it knows there is only one).
Attached are the config info from the install. And listed below
seems to be the crash point from the mca_bml_r2_progress function.
Any help is much appreciated.
Karl
CRASH 1:
Command: nm
Path: /Users/karl/programs/nm/build/Release/nm
Parent: orted [830]
Version: ??? (???)
PID: 834
Thread: 0
Exception: EXC_BAD_ACCESS (0x0001)
Codes: KERN_INVALID_ADDRESS (0x0001) at 0xfdff8018
Thread 0 Crashed:
0 mca_btl_sm.so 0x003abbec mca_btl_sm_component_progress +
3164
1 mca_bml_r2.so 0x003a0d38 mca_bml_r2_progress + 88
2 libopal.0.dylib 0x0032309c opal_progress + 236
3 mca_oob_tcp.so 0x00024f14 mca_oob_tcp_msg_wait + 52
4 mca_oob_tcp.so 0x0002a0a8 mca_oob_tcp_recv + 1128
5 liborte.0.dylib 0x002f07b0 mca_oob_recv_packed + 80
6 mca_gpr_proxy.so 0x00059bd4 orte_gpr_proxy_put + 804
7 liborte.0.dylib 0x00304318 orte_soh_base_set_proc_soh + 968
8 libmpi.0.dylib 0x00222d88 ompi_mpi_init + 1816
9 libmpi.0.dylib 0x00248b50 MPI_Init + 240
10 nm 0x00002e60 init_model + 48
11 nm 0x00002c70 main + 48
12 nm 0x00002494 _start + 340 (crt.c:272)
13 nm 0x0000233c start + 60
Thread 0 crashed with PPC Thread State 64:
srr0: 0x00000000003abbec srr1:
0x000000000200f930 vrsave: 0x0000000000000000
cr: 0x28004222 xer: 0x0000000000000004 lr:
0x00000000003aafa0 ctr: 0x00000000003aaf90
r0: 0x0000000000000000 r1: 0x00000000bfffe8d0 r2:
0x00000000fdff8000 r3: 0x0000000000000001
r4: 0x0000000000049814 r5: 0x00000000bfffe888 r6:
0x0000000000000000 r7: 0x00000000fdff8000
r8: 0x0000000000000004 r9: 0x00000000004177e0 r10:
0x0000000000000004 r11: 0x0000000000000000
r12: 0x00000000003aaf90 r13: 0x00000000fffffffe r14:
0x00000000003ad004 r15: 0x00000000003441e8
r16: 0x00000000003ad8c4 r17: 0x0000000000000004 r18:
0x0000000000000000 r19: 0x0000000000000000
r20: 0x0000000000000014 r21: 0x0000000000000000 r22:
0x00000000003ae0c4 r23: 0x0000000000000001
r24: 0x0000000000000000 r25: 0x0000000000000004 r26:
0x0000000000029c50 r27: 0x0000000000000000
r28: 0x0000000000000000 r29: 0x0000000000000001 r30:
0x0000000000000000 r31: 0x00000000003aafa0
CRASH 2:
Command: nm
Path: /Users/karl/programs/nm/build/Release/nm
Parent: orted [830]
Version: ??? (???)
PID: 832
Thread: 0
Exception: EXC_BAD_ACCESS (0x0001)
Codes: KERN_PROTECTION_FAILURE (0x0002) at 0x00000000
Thread 0 Crashed:
0 <<00000000>> 0x00000000 0 + 0
1 mca_bml_r2.so 0x003a0d38 mca_bml_r2_progress + 88
2 libopal.0.dylib 0x0032309c opal_progress + 236
3 mca_oob_tcp.so 0x00024f14 mca_oob_tcp_msg_wait + 52
4 mca_oob_tcp.so 0x0002a0a8 mca_oob_tcp_recv + 1128
5 liborte.0.dylib 0x002f07b0 mca_oob_recv_packed + 80
6 mca_gpr_proxy.so 0x00059bd4 orte_gpr_proxy_put + 804
7 liborte.0.dylib 0x00304318 orte_soh_base_set_proc_soh + 968
8 libmpi.0.dylib 0x00222d88 ompi_mpi_init + 1816
9 libmpi.0.dylib 0x00248b50 MPI_Init + 240
10 nm 0x00002e60 init_model + 48
11 nm 0x00002c70 main + 48
12 nm 0x00002494 _start + 340 (crt.c:272)
13 nm 0x0000233c start + 60
Thread 0 crashed with PPC Thread State 64:
srr0: 0x0000000000000000 srr1:
0x000000004000d930 vrsave: 0x0000000000000000
cr: 0x28004222 xer: 0x0000000000000004 lr:
0x00000000003abe5c ctr: 0x0000000000000000
r0: 0x0000000000000000 r1: 0x00000000bfffe8d0 r2:
0x0000000002008000 r3: 0x00000000003ad864
r4: 0x0000000000000000 r5: 0x0000000002008000 r6:
0x0000000000000000 r7: 0x0000000002008000
r8: 0x00000000003ad8c4 r9: 0x00000000004177e0 r10:
0x0000000000000000 r11: 0x0000000000000000
r12: 0x0000000000000000 r13: 0x00000000fffffffe r14:
0x00000000003ad004 r15: 0x00000000003441e8
r16: 0x00000000003ad8c4 r17: 0x0000000000000000 r18:
0x0000000000000000 r19: 0x0000000000000000
r20: 0x0000000000000000 r21: 0x0000000000000000 r22:
0x00000000003ae0c4 r23: 0x00000000003441e8
r24: 0x0000000000000000 r25: 0x0000000002008000 r26:
0x00000000003ae0c4 r27: 0x0000000000000001
r28: 0x0000000000000004 r29: 0x0000000000000001 r30:
0x0000000000000000 r31: 0x00000000003aafa0
CRASH 3:
Command: nm
Path: /Users/karl/programs/nm/build/Debug/nm
Parent: orted [1790]
Version: ??? (???)
PID: 1794
Thread: 0
Exception: EXC_BAD_ACCESS (0x0001)
Codes: KERN_INVALID_ADDRESS (0x0001) at 0xfdff8018
Thread 0 Crashed:
0 mca_btl_sm.so 0x003bcbec mca_btl_sm_component_progress +
3164
1 mca_bml_r2.so 0x003b1d38 mca_bml_r2_progress + 88
2 libopal.0.dylib 0x0032309c opal_progress + 236
3 mca_oob_tcp.so 0x00055f14 mca_oob_tcp_msg_wait + 52
4 mca_oob_tcp.so 0x0005b0a8 mca_oob_tcp_recv + 1128
5 liborte.0.dylib 0x002f07b0 mca_oob_recv_packed + 80
6 mca_gpr_proxy.so 0x00068bd4 orte_gpr_proxy_put + 804
7 liborte.0.dylib 0x00304318 orte_soh_base_set_proc_soh + 968
8 libmpi.0.dylib 0x00222d88 ompi_mpi_init + 1816
9 libmpi.0.dylib 0x00248b50 MPI_Init + 240
10 nm 0x000028fc init_model + 80 (model.c:16)
11 nm 0x00002644 main + 72 (main.c:16)
12 nm 0x00001e54 _start + 340 (crt.c:272)
13 nm 0x00001cfc start + 60
Thread 0 crashed with PPC Thread State 64:
srr0: 0x00000000003bcbec srr1:
0x000000000200f930 vrsave: 0x0000000000000000
cr: 0x28004222 xer: 0x0000000000000004 lr:
0x00000000003bbfa0 ctr: 0x00000000003bbf90
r0: 0x0000000000000000 r1: 0x00000000bfffe8f0 r2:
0x00000000fdff8000 r3: 0x0000000000000001
r4: 0x0000000000049814 r5: 0x00000000bfffe8a8 r6:
0x0000000000000000 r7: 0x00000000fdff8000
r8: 0x0000000000000004 r9: 0x00000000004177d0 r10:
0x0000000000000004 r11: 0x0000000000000000
r12: 0x00000000003bbf90 r13: 0x00000000fffffffe r14:
0x00000000003be004 r15: 0x00000000003441e8
r16: 0x00000000003be8c4 r17: 0x0000000000000004 r18:
0x0000000000000000 r19: 0x0000000000000000
r20: 0x0000000000000014 r21: 0x0000000000000000 r22:
0x00000000003bf0c4 r23: 0x0000000000000001
r24: 0x0000000000000000 r25: 0x0000000000000004 r26:
0x000000000005ac50 r27: 0x0000000000000000
r28: 0x0000000000000000 r29: 0x0000000000000001 r30:
0x0000000000000000 r31: 0x00000000003bbfa0
CRASH 4:
Command: nm
Path: /Users/karl/programs/nm/build/Debug/nm
Parent: orted [1790]
Version: ??? (???)
PID: 1792
Thread: 0
Exception: EXC_BAD_ACCESS (0x0001)
Codes: KERN_PROTECTION_FAILURE (0x0002) at 0x00000000
Thread 0 Crashed:
0 <<00000000>> 0x00000000 0 + 0
1 mca_bml_r2.so 0x003b1d38 mca_bml_r2_progress + 88
2 libopal.0.dylib 0x0032309c opal_progress + 236
3 mca_oob_tcp.so 0x00055f14 mca_oob_tcp_msg_wait + 52
4 mca_oob_tcp.so 0x0005b0a8 mca_oob_tcp_recv + 1128
5 liborte.0.dylib 0x002f07b0 mca_oob_recv_packed + 80
6 mca_gpr_proxy.so 0x00068bd4 orte_gpr_proxy_put + 804
7 liborte.0.dylib 0x00304318 orte_soh_base_set_proc_soh + 968
8 libmpi.0.dylib 0x00222d88 ompi_mpi_init + 1816
9 libmpi.0.dylib 0x00248b50 MPI_Init + 240
10 nm 0x000028fc init_model + 80 (model.c:16)
11 nm 0x00002644 main + 72 (main.c:16)
12 nm 0x00001e54 _start + 340 (crt.c:272)
13 nm 0x00001cfc start + 60
Thread 0 crashed with PPC Thread State 64:
srr0: 0x0000000000000000 srr1:
0x000000004000d930 vrsave: 0x0000000000000000
cr: 0x28004222 xer: 0x0000000000000004 lr:
0x00000000003bce5c ctr: 0x0000000000000000
r0: 0x0000000000000000 r1: 0x00000000bfffe8f0 r2:
0x0000000002008000 r3: 0x00000000003be864
r4: 0x0000000000000000 r5: 0x0000000002008000 r6:
0x0000000000000000 r7: 0x0000000002008000
r8: 0x00000000003be8c4 r9: 0x00000000004177d0 r10:
0x0000000000000000 r11: 0x0000000000000000
r12: 0x0000000000000000 r13: 0x00000000fffffffe r14:
0x00000000003be004 r15: 0x00000000003441e8
r16: 0x00000000003be8c4 r17: 0x0000000000000000 r18:
0x0000000000000000 r19: 0x0000000000000000
r20: 0x0000000000000000 r21: 0x0000000000000000 r22:
0x00000000003bf0c4 r23: 0x00000000003441e8
r24: 0x0000000000000000 r25: 0x0000000002008000 r26:
0x00000000003bf0c4 r27: 0x0000000000000001
r28: 0x0000000000000004 r29: 0x0000000000000001 r30:
0x0000000000000000 r31: 0x00000000003bbfa0
<info.tar.gz>
_______________________________________________
devel mailing list
de...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/devel
--
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems