On Sat, Feb 27, 2010 at 7:35 PM, Ralph Castain <r...@open-mpi.org> wrote: > I can't seem to replicate this first problem - it runs fine for me even if > the rankfile contains only one entry.
First of all, thanks for taking a look at this ! For me it's repeatable. Please note that I do specify '-np 4' even when in the ranks file there is only one entry; I've just checked that this happens with some random value given to -np, the only time I don't get a segv is with '-np 1' in which case I get the 'PAFFINITY cannot get physical core id...' error message. However, with other combinations, like 2 entries in the ranks file and '-np 4' the segv doesn't appear, only the error message. Anyway, for the original case (one entry in the ranks file and -np 4'), the output obtained with the suggested debug is: [mbm-01-24:24102] mca:base:select:(rmaps) Querying component [load_balance] [mbm-01-24:24102] mca:base:select:(rmaps) Skipping component [load_balance]. Query failed to return a module [mbm-01-24:24102] mca:base:select:(rmaps) Querying component [rank_file] [mbm-01-24:24102] mca:base:select:(rmaps) Query of component [rank_file] set priority to 100 [mbm-01-24:24102] mca:base:select:(rmaps) Querying component [round_robin] [mbm-01-24:24102] mca:base:select:(rmaps) Query of component [round_robin] set priority to 70 [mbm-01-24:24102] mca:base:select:(rmaps) Querying component [seq] [mbm-01-24:24102] mca:base:select:(rmaps) Query of component [seq] set priority to 0 [mbm-01-24:24102] mca:base:select:(rmaps) Selected component [rank_file] [mbm-01-24:24102] procdir: /tmp/openmpi-sessions-bq_bcostescu@mbm-01-24_0/36756/0/0 [mbm-01-24:24102] jobdir: /tmp/openmpi-sessions-bq_bcostescu@mbm-01-24_0/36756/0 [mbm-01-24:24102] top: openmpi-sessions-bq_bcostescu@mbm-01-24_0 [mbm-01-24:24102] tmp: /tmp [mbm-01-24:24102] mpirun: reset PATH: /sw/openmpi/1.4.1-debug/gcc/4.4.3/bin:/usr/local/bin:/bin:/usr/bin:/home/bq_bcostescu/bin [mbm-01-24:24102] mpirun: reset LD_LIBRARY_PATH: /sw/openmpi/1.4.1-debug/gcc/4.4.3/lib [mbm-01-24:24102] [[36756,0],0] hostfile: checking hostfile hosts for nodes [mbm-01-24:24102] [[36756,0],0] hostfile: filtering nodes through hostfile hosts [mbm-01-24:24102] [[36756,0],0] rmaps:base:claim_slot: created new proc [[36756,1],INVALID] [mbm-01-24:24102] [[36756,0],0] rmaps:base:claim_slot mapping proc in job [36756,1] to node mbm-01-24 [mbm-01-24:24102] [[36756,0],0] rmaps:base: adding node mbm-01-24 to map [mbm-01-24:24102] [[36756,0],0] rmaps:base: mapping proc for job [36756,1] to node mbm-01-24 [mbm-01-24:24102] [[36756,0],0] hostfile: filtering nodes through hostfile hosts [mbm-01-24:24102] [[36756,0],0] rmaps:base:claim_slot: created new proc [[36756,1],INVALID] [mbm-01-24:24102] [[36756,0],0] rmaps:base:claim_slot mapping proc in job [36756,1] to node mbm-01-24 [mbm-01-24:24102] [[36756,0],0] rmaps:base: mapping proc for job [36756,1] to node mbm-01-24 [mbm-01-24:24102] [[36756,0],0] rmaps:base:claim_slot: created new proc [[36756,1],INVALID] [mbm-01-24:24102] [[36756,0],0] rmaps:base:claim_slot mapping proc in job [36756,1] to node mbm-01-24 [mbm-01-24:24102] [[36756,0],0] rmaps:base: mapping proc for job [36756,1] to node mbm-01-24 [mbm-01-24:24102] [[36756,0],0] rmaps:base:claim_slot: created new proc [[36756,1],INVALID] [mbm-01-24:24102] [[36756,0],0] rmaps:base:claim_slot mapping proc in job [36756,1] to node mbm-01-24 [mbm-01-24:24102] [[36756,0],0] rmaps:base: mapping proc for job [36756,1] to node mbm-01-24 [mbm-01-24:24102] [[36756,0],0] rmaps:base:compute_usage [mbm-01-24:24102] [[36756,0],0] rmaps:base:define_daemons [mbm-01-24:24102] [[36756,0],0] rmaps:base:define_daemons existing daemon [[36756,0],0] already launched [mbm-01-24:24102] *** Process received signal *** [mbm-01-24:24102] Signal: Segmentation fault (11) [mbm-01-24:24102] Signal code: Address not mapped (1) [mbm-01-24:24102] Failing at address: 0x70 [mbm-01-24:24102] [ 0] /lib64/libpthread.so.0 [0x2b04e8c727c0] [mbm-01-24:24102] [ 1] /sw/openmpi/1.4.1-debug/gcc/4.4.3/lib/libopen-rte.so.0(orte_util_encode_pidmap+0x140) [0x2b04e7c5b312] [mbm-01-24:24102] [ 2] /sw/openmpi/1.4.1-debug/gcc/4.4.3/lib/libopen-rte.so.0(orte_odls_base_default_get_add_procs_data+0xb31) [0x2b04e7c89557] [mbm-01-24:24102] [ 3] /sw/openmpi/1.4.1-debug/gcc/4.4.3/lib/libopen-rte.so.0(orte_plm_base_launch_apps+0x1f6) [0x2b04e7ca9210] [mbm-01-24:24102] [ 4] /sw/openmpi/1.4.1-debug/gcc/4.4.3/lib/libopen-rte.so.0 [0x2b04e7cb3f2f] [mbm-01-24:24102] [ 5] /sw/openmpi/1.4.1-debug/gcc/4.4.3/bin/mpirun [0x403d3b] [mbm-01-24:24102] [ 6] /sw/openmpi/1.4.1-debug/gcc/4.4.3/bin/mpirun [0x402ee4] [mbm-01-24:24102] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4) [0x2b04e8e9c994] [mbm-01-24:24102] [ 8] /sw/openmpi/1.4.1-debug/gcc/4.4.3/bin/mpirun [0x402e09] [mbm-01-24:24102] *** End of error message *** Segmentation fault After applying by hand the r21728 to the original 1.4.1, I can start properly the job as expected, the 'PAFFINITY cannot get physical core id...' error message doesn't appear anymore, so I'd like to ask for it to be applied to the 1.4 series. With this, I've tested the following combinations: entries in ranks file -np result 1 1 OK 1 2 segv 1 4 segv 2 1 OK 2 2 OK 2 4 OK 4 4 OK So the segv's really only appear when there's only one entry in the ranks file; if I'm the only one to be able to reproduce these segv's, I'd be happy to look into it with some guidance about the actual source code location... Cheers, Bogdan