I'm looking at the first problem - will get back to you on it. As to the second issue: it was r21728, and no - it does not appear to have been moved to the 1.4 series (rankfile is untested on a regular basis). I will do so now.
Thanks! Ralph On Feb 15, 2010, at 10:39 AM, Bogdan Costescu wrote: > Hi! > > With version 1.4.1 I get a rather strange crash in mpirun whenever I > try to run a job using (I think) a rankfile which doesn't contain the > specified number of ranks. F.e. I ask for 4 ranks ('-np 4'), but the > rankfile contains only one entry: > > rank 0=mbm-01-24 slot=1:* > > and the following comes out: > > [mbm-01-24:20985] *** Process received signal *** > [mbm-01-24:20985] Signal: Segmentation fault (11) > [mbm-01-24:20985] Signal code: Address not mapped (1) > [mbm-01-24:20985] Failing at address: 0x50 > [mbm-01-24:20985] [ 0] /lib64/libpthread.so.0 [0x2b9de894f7c0] > [mbm-01-24:20985] [ 1] > /sw/openmpi/1.4.1/gcc/4.4.3/lib/libopen-rte.so.0(orte_util_encode_pidmap+0xbb) > [0x2b9de79b9f7b] > [mbm-01-24:20985] [ 2] > /sw/openmpi/1.4.1/gcc/4.4.3/lib/libopen-rte.so.0(orte_odls_base_default_get_add_procs_data+0x2d0) > [0x2b9de79d49c0] > [mbm-01-24:20985] [ 3] > /sw/openmpi/1.4.1/gcc/4.4.3/lib/libopen-rte.so.0(orte_plm_base_launch_apps+0xbc) > [0x2b9de79e1fcc] > [mbm-01-24:20985] [ 4] > /sw/openmpi/1.4.1/gcc/4.4.3/lib/libopen-rte.so.0 [0x2b9de79e6251] > [mbm-01-24:20985] [ 5] mpirun [0x403782] > [mbm-01-24:20985] [ 6] mpirun [0x402cb4] > [mbm-01-24:20985] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4) > [0x2b9de8b79994] > [mbm-01-24:20985] [ 8] mpirun [0x402bd9] > [mbm-01-24:20985] *** End of error message *** > Segmentation fault > > However if the rankfile contains a second entry, like: > > rank 0=mbm-01-24 slot=1:* > rank 1=mbm-01-24 slot=1:* > > I get an error, but no segmentation fault. I guess that the > segmentation fault is unintended... Is this known ? If not, how could > I debug this ? > Now to the second problem: the exact same error keeps coming even if I > specify 4 ranks, the messages are: > > -------------------------------------------------------------------------- > mpirun was unable to start the specified application as it encountered an > error: > > Error name: Error > Node: mbm-01-24 > > when attempting to start process rank 0. > -------------------------------------------------------------------------- > [mbm-01-24:21011] Rank 0: PAFFINITY cannot get physical core id for > logical core 4 in physical socket 1 (1) > -------------------------------------------------------------------------- > We were unable to successfully process/set the requested processor > affinity settings: > > Specified slot list: 1:* > Error: Error > > This could mean that a non-existent processor was specified, or > that the specification had improper syntax. > -------------------------------------------------------------------------- > > The node has 2 slots, each with 4 cores, so what I'm trying to achieve > is using the 4 cores of the second slot. When searching the archives, > I stumbled on an e-mail from not too long ago which seemingly dealt > with the same error: > > http://www.open-mpi.org/community/lists/devel/2009/07/6513.php > > which suggests that a fix was found, but no commit was specified, so I > can't track down whether this was actually also applied to the stable > series. Could someone more knowledgeable in this area shed some light > ? > > Thanks in advance! > Bogdan > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel