Tracking this down has reminded me of all the reasons why I despise the rankfile mapper... :-/
I have created a fix for this mess and will submit it for inclusion in 1.4. Thanks - not your fault, so pardon the comments. Just had my fill of this particular code since the creators of it no longer support it. Ralph On Mar 1, 2010, at 9:15 AM, Bogdan Costescu wrote: > On Sat, Feb 27, 2010 at 7:35 PM, Ralph Castain <r...@open-mpi.org> wrote: >> I can't seem to replicate this first problem - it runs fine for me even if >> the rankfile contains only one entry. > > First of all, thanks for taking a look at this ! > > For me it's repeatable. Please note that I do specify '-np 4' even > when in the ranks file there is only one entry; I've just checked that > this happens with some random value given to -np, the only time I > don't get a segv is with '-np 1' in which case I get the 'PAFFINITY > cannot get physical core id...' error message. However, with other > combinations, like 2 entries in the ranks file and '-np 4' the segv > doesn't appear, only the error message. Anyway, for the original case > (one entry in the ranks file and -np 4'), the output obtained with the > suggested debug is: > > [mbm-01-24:24102] mca:base:select:(rmaps) Querying component [load_balance] > [mbm-01-24:24102] mca:base:select:(rmaps) Skipping component > [load_balance]. Query failed to return a module > [mbm-01-24:24102] mca:base:select:(rmaps) Querying component [rank_file] > [mbm-01-24:24102] mca:base:select:(rmaps) Query of component > [rank_file] set priority to 100 > [mbm-01-24:24102] mca:base:select:(rmaps) Querying component [round_robin] > [mbm-01-24:24102] mca:base:select:(rmaps) Query of component > [round_robin] set priority to 70 > [mbm-01-24:24102] mca:base:select:(rmaps) Querying component [seq] > [mbm-01-24:24102] mca:base:select:(rmaps) Query of component [seq] set > priority to 0 > [mbm-01-24:24102] mca:base:select:(rmaps) Selected component [rank_file] > [mbm-01-24:24102] procdir: > /tmp/openmpi-sessions-bq_bcostescu@mbm-01-24_0/36756/0/0 > [mbm-01-24:24102] jobdir: > /tmp/openmpi-sessions-bq_bcostescu@mbm-01-24_0/36756/0 > [mbm-01-24:24102] top: openmpi-sessions-bq_bcostescu@mbm-01-24_0 > [mbm-01-24:24102] tmp: /tmp > [mbm-01-24:24102] mpirun: reset PATH: > /sw/openmpi/1.4.1-debug/gcc/4.4.3/bin:/usr/local/bin:/bin:/usr/bin:/home/bq_bcostescu/bin > [mbm-01-24:24102] mpirun: reset LD_LIBRARY_PATH: > /sw/openmpi/1.4.1-debug/gcc/4.4.3/lib > [mbm-01-24:24102] [[36756,0],0] hostfile: checking hostfile hosts for nodes > [mbm-01-24:24102] [[36756,0],0] hostfile: filtering nodes through hostfile > hosts > [mbm-01-24:24102] [[36756,0],0] rmaps:base:claim_slot: created new > proc [[36756,1],INVALID] > [mbm-01-24:24102] [[36756,0],0] rmaps:base:claim_slot mapping proc in > job [36756,1] to node mbm-01-24 > [mbm-01-24:24102] [[36756,0],0] rmaps:base: adding node mbm-01-24 to map > [mbm-01-24:24102] [[36756,0],0] rmaps:base: mapping proc for job > [36756,1] to node mbm-01-24 > [mbm-01-24:24102] [[36756,0],0] hostfile: filtering nodes through hostfile > hosts > [mbm-01-24:24102] [[36756,0],0] rmaps:base:claim_slot: created new > proc [[36756,1],INVALID] > [mbm-01-24:24102] [[36756,0],0] rmaps:base:claim_slot mapping proc in > job [36756,1] to node mbm-01-24 > [mbm-01-24:24102] [[36756,0],0] rmaps:base: mapping proc for job > [36756,1] to node mbm-01-24 > [mbm-01-24:24102] [[36756,0],0] rmaps:base:claim_slot: created new > proc [[36756,1],INVALID] > [mbm-01-24:24102] [[36756,0],0] rmaps:base:claim_slot mapping proc in > job [36756,1] to node mbm-01-24 > [mbm-01-24:24102] [[36756,0],0] rmaps:base: mapping proc for job > [36756,1] to node mbm-01-24 > [mbm-01-24:24102] [[36756,0],0] rmaps:base:claim_slot: created new > proc [[36756,1],INVALID] > [mbm-01-24:24102] [[36756,0],0] rmaps:base:claim_slot mapping proc in > job [36756,1] to node mbm-01-24 > [mbm-01-24:24102] [[36756,0],0] rmaps:base: mapping proc for job > [36756,1] to node mbm-01-24 > [mbm-01-24:24102] [[36756,0],0] rmaps:base:compute_usage > [mbm-01-24:24102] [[36756,0],0] rmaps:base:define_daemons > [mbm-01-24:24102] [[36756,0],0] rmaps:base:define_daemons existing > daemon [[36756,0],0] already launched > [mbm-01-24:24102] *** Process received signal *** > [mbm-01-24:24102] Signal: Segmentation fault (11) > [mbm-01-24:24102] Signal code: Address not mapped (1) > [mbm-01-24:24102] Failing at address: 0x70 > [mbm-01-24:24102] [ 0] /lib64/libpthread.so.0 [0x2b04e8c727c0] > [mbm-01-24:24102] [ 1] > /sw/openmpi/1.4.1-debug/gcc/4.4.3/lib/libopen-rte.so.0(orte_util_encode_pidmap+0x140) > [0x2b04e7c5b312] > [mbm-01-24:24102] [ 2] > /sw/openmpi/1.4.1-debug/gcc/4.4.3/lib/libopen-rte.so.0(orte_odls_base_default_get_add_procs_data+0xb31) > [0x2b04e7c89557] > [mbm-01-24:24102] [ 3] > /sw/openmpi/1.4.1-debug/gcc/4.4.3/lib/libopen-rte.so.0(orte_plm_base_launch_apps+0x1f6) > [0x2b04e7ca9210] > [mbm-01-24:24102] [ 4] > /sw/openmpi/1.4.1-debug/gcc/4.4.3/lib/libopen-rte.so.0 > [0x2b04e7cb3f2f] > [mbm-01-24:24102] [ 5] /sw/openmpi/1.4.1-debug/gcc/4.4.3/bin/mpirun [0x403d3b] > [mbm-01-24:24102] [ 6] /sw/openmpi/1.4.1-debug/gcc/4.4.3/bin/mpirun [0x402ee4] > [mbm-01-24:24102] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4) > [0x2b04e8e9c994] > [mbm-01-24:24102] [ 8] /sw/openmpi/1.4.1-debug/gcc/4.4.3/bin/mpirun [0x402e09] > [mbm-01-24:24102] *** End of error message *** > Segmentation fault > > After applying by hand the r21728 to the original 1.4.1, I can start > properly the job as expected, the 'PAFFINITY cannot get physical core > id...' error message doesn't appear anymore, so I'd like to ask for it > to be applied to the 1.4 series. With this, I've tested the following > combinations: > > entries in ranks file -np result > 1 1 OK > 1 2 segv > 1 4 segv > 2 1 OK > 2 2 OK > 2 4 OK > 4 4 OK > > So the segv's really only appear when there's only one entry in the > ranks file; if I'm the only one to be able to reproduce these segv's, > I'd be happy to look into it with some guidance about the actual > source code location... > > Cheers, > Bogdan > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel