Hello,

I'm developing an mca_pls module, intending to drop it into a preexisting Open MPI build (in its lib/openmpi directory) and have orterun pick it up, but orterun kept crashing on me even though it correctly calls my module. To help isolate the issue I separately recompiled the mca_pls_rsh module from a given Open MPI source checkout and dropping that didn't work either. Any pointers?

To give an idea of what's going on here's an example attempt to run on two local processors:

dauger$ orterun -mca pls rsh -mca pls_base_verbose 10 --debug-devel -- np 2 --host localhost "/Users/dauger/Documents/ompi-trunk/pingpong"
[Rotarran-X-5.local:04475] connect_uni: connection not allowed
[Rotarran-X-5.local:04475] mca: base: components_open: Looking for pls components [Rotarran-X-5.local:04475] mca: base: components_open: distilling pls components [Rotarran-X-5.local:04475] mca: base: components_open: including pls components [Rotarran-X-5.local:04475] mca: base: components_open: rsh --> included [Rotarran-X-5.local:04475] mca: base: components_open: opening pls components [Rotarran-X-5.local:04475] mca: base: components_open: found loaded component rsh [Rotarran-X-5.local:04475] mca: base: components_open: component rsh open function successful
[Rotarran-X-5.local:04475] orte:base:select: querying component rsh
[Rotarran-X-5.local:04475] [0,0,0] setting up session dir with
[Rotarran-X-5.local:04475]      universe default-universe-4475
[Rotarran-X-5.local:04475]      user dauger
[Rotarran-X-5.local:04475]      host Rotarran-X-5.local
[Rotarran-X-5.local:04475]      jobid 0
[Rotarran-X-5.local:04475]      procid 0
[Rotarran-X-5.local:04475] procdir: /var/folders/oE/oENz6Cd +FTCWQbRGkntLLU+++TI/-Tmp-//openmpi-sessions-dauger@Rotarran- X-5.local_0/default-universe-4475/0/0 [Rotarran-X-5.local:04475] jobdir: /var/folders/oE/oENz6Cd +FTCWQbRGkntLLU+++TI/-Tmp-//openmpi-sessions-dauger@Rotarran- X-5.local_0/default-universe-4475/0 [Rotarran-X-5.local:04475] unidir: /var/folders/oE/oENz6Cd +FTCWQbRGkntLLU+++TI/-Tmp-//openmpi-sessions-dauger@Rotarran- X-5.local_0/default-universe-4475 [Rotarran-X-5.local:04475] top: openmpi-sessions-dauger@Rotarran- X-5.local_0 [Rotarran-X-5.local:04475] tmp: /var/folders/oE/oENz6Cd+FTCWQbRGkntLLU +++TI/-Tmp-/ [Rotarran-X-5.local:04475] [0,0,0] contact_file /var/folders/oE/ oENz6Cd+FTCWQbRGkntLLU+++TI/-Tmp-//openmpi-sessions-dauger@Rotarran- X-5.local_0/default-universe-4475/universe-setup.txt
[Rotarran-X-5.local:04475] [0,0,0] wrote setup file
[Rotarran-X-5:04475] *** Process received signal ***
[Rotarran-X-5:04475] Signal: Bus error (10)
[Rotarran-X-5:04475] Signal code:  (2)
[Rotarran-X-5:04475] Failing at address: 0x0
[ 1] [0xbffff828, 0x00000000] (-P-)
[ 2] (orterun + 0x457) [0xbffff8b8, 0x00001d07]
[ 3] (main + 0x18) [0xbffff8d8, 0x000018ae]
[ 4] (start + 0x36) [0xbffff8fc, 0x0000186a]
[ 5] [0x00000000, 0x0000000d] (FP-)
[Rotarran-X-5:04475] *** End of error message ***
Bus error

pingpong was compiled with the existing Open MPI, and it runs with the built-in rsh module, but not when I replace the pls_rsh module with a recompiled one. When I add printf's in the pls_rsh module in its _open and _init, I can show each of its subroutines return without problem, but _launch is not yet called. I'm running Mac OS X 10.5.1, which ships with Open MPI at /usr, on a MacBook Pro with an Intel Core Duo. ("Rotarran X.5" is the name of the computer.) I first attempted the 1.3.0 source code via svn, then went back to the 1.2.3 source code from Open MPI, but both gave the above bus error. Then I went to Apple's copy of Open MPI 1.2.3 at opensource.apple.com guessing Apple changed things, but that still doesn't work. I've tried their take on ./configure options too to no avail. Other than debugging orterun, what else can I try?

Thanks in advance,
   Dean

Reply via email to