Tracking this down has reminded me of all the reasons why I despise the 
rankfile mapper... :-/

I have created a fix for this mess and will submit it for inclusion in 1.4.

Thanks - not your fault, so pardon the comments. Just had my fill of this 
particular code since the creators of it no longer support it.
Ralph


On Mar 1, 2010, at 9:15 AM, Bogdan Costescu wrote:

> On Sat, Feb 27, 2010 at 7:35 PM, Ralph Castain <r...@open-mpi.org> wrote:
>> I can't seem to replicate this first problem - it runs fine for me even if 
>> the rankfile contains only one entry.
> 
> First of all, thanks for taking a look at this !
> 
> For me it's repeatable. Please note that I do specify '-np 4' even
> when in the ranks file there is only one entry; I've just checked that
> this happens with some random value given to -np, the only time I
> don't get a segv is with '-np 1' in which case I get the 'PAFFINITY
> cannot get physical core id...' error message. However, with other
> combinations, like 2 entries in the ranks file and '-np 4' the segv
> doesn't appear, only the error message. Anyway, for the original case
> (one entry in the ranks file and -np 4'), the output obtained with the
> suggested debug is:
> 
> [mbm-01-24:24102] mca:base:select:(rmaps) Querying component [load_balance]
> [mbm-01-24:24102] mca:base:select:(rmaps) Skipping component
> [load_balance]. Query failed to return a module
> [mbm-01-24:24102] mca:base:select:(rmaps) Querying component [rank_file]
> [mbm-01-24:24102] mca:base:select:(rmaps) Query of component
> [rank_file] set priority to 100
> [mbm-01-24:24102] mca:base:select:(rmaps) Querying component [round_robin]
> [mbm-01-24:24102] mca:base:select:(rmaps) Query of component
> [round_robin] set priority to 70
> [mbm-01-24:24102] mca:base:select:(rmaps) Querying component [seq]
> [mbm-01-24:24102] mca:base:select:(rmaps) Query of component [seq] set
> priority to 0
> [mbm-01-24:24102] mca:base:select:(rmaps) Selected component [rank_file]
> [mbm-01-24:24102] procdir:
> /tmp/openmpi-sessions-bq_bcostescu@mbm-01-24_0/36756/0/0
> [mbm-01-24:24102] jobdir: 
> /tmp/openmpi-sessions-bq_bcostescu@mbm-01-24_0/36756/0
> [mbm-01-24:24102] top: openmpi-sessions-bq_bcostescu@mbm-01-24_0
> [mbm-01-24:24102] tmp: /tmp
> [mbm-01-24:24102] mpirun: reset PATH:
> /sw/openmpi/1.4.1-debug/gcc/4.4.3/bin:/usr/local/bin:/bin:/usr/bin:/home/bq_bcostescu/bin
> [mbm-01-24:24102] mpirun: reset LD_LIBRARY_PATH:
> /sw/openmpi/1.4.1-debug/gcc/4.4.3/lib
> [mbm-01-24:24102] [[36756,0],0] hostfile: checking hostfile hosts for nodes
> [mbm-01-24:24102] [[36756,0],0] hostfile: filtering nodes through hostfile 
> hosts
> [mbm-01-24:24102] [[36756,0],0] rmaps:base:claim_slot: created new
> proc [[36756,1],INVALID]
> [mbm-01-24:24102] [[36756,0],0] rmaps:base:claim_slot mapping proc in
> job [36756,1] to node mbm-01-24
> [mbm-01-24:24102] [[36756,0],0] rmaps:base: adding node mbm-01-24 to map
> [mbm-01-24:24102] [[36756,0],0] rmaps:base: mapping proc for job
> [36756,1] to node mbm-01-24
> [mbm-01-24:24102] [[36756,0],0] hostfile: filtering nodes through hostfile 
> hosts
> [mbm-01-24:24102] [[36756,0],0] rmaps:base:claim_slot: created new
> proc [[36756,1],INVALID]
> [mbm-01-24:24102] [[36756,0],0] rmaps:base:claim_slot mapping proc in
> job [36756,1] to node mbm-01-24
> [mbm-01-24:24102] [[36756,0],0] rmaps:base: mapping proc for job
> [36756,1] to node mbm-01-24
> [mbm-01-24:24102] [[36756,0],0] rmaps:base:claim_slot: created new
> proc [[36756,1],INVALID]
> [mbm-01-24:24102] [[36756,0],0] rmaps:base:claim_slot mapping proc in
> job [36756,1] to node mbm-01-24
> [mbm-01-24:24102] [[36756,0],0] rmaps:base: mapping proc for job
> [36756,1] to node mbm-01-24
> [mbm-01-24:24102] [[36756,0],0] rmaps:base:claim_slot: created new
> proc [[36756,1],INVALID]
> [mbm-01-24:24102] [[36756,0],0] rmaps:base:claim_slot mapping proc in
> job [36756,1] to node mbm-01-24
> [mbm-01-24:24102] [[36756,0],0] rmaps:base: mapping proc for job
> [36756,1] to node mbm-01-24
> [mbm-01-24:24102] [[36756,0],0] rmaps:base:compute_usage
> [mbm-01-24:24102] [[36756,0],0] rmaps:base:define_daemons
> [mbm-01-24:24102] [[36756,0],0] rmaps:base:define_daemons existing
> daemon [[36756,0],0] already launched
> [mbm-01-24:24102] *** Process received signal ***
> [mbm-01-24:24102] Signal: Segmentation fault (11)
> [mbm-01-24:24102] Signal code: Address not mapped (1)
> [mbm-01-24:24102] Failing at address: 0x70
> [mbm-01-24:24102] [ 0] /lib64/libpthread.so.0 [0x2b04e8c727c0]
> [mbm-01-24:24102] [ 1]
> /sw/openmpi/1.4.1-debug/gcc/4.4.3/lib/libopen-rte.so.0(orte_util_encode_pidmap+0x140)
> [0x2b04e7c5b312]
> [mbm-01-24:24102] [ 2]
> /sw/openmpi/1.4.1-debug/gcc/4.4.3/lib/libopen-rte.so.0(orte_odls_base_default_get_add_procs_data+0xb31)
> [0x2b04e7c89557]
> [mbm-01-24:24102] [ 3]
> /sw/openmpi/1.4.1-debug/gcc/4.4.3/lib/libopen-rte.so.0(orte_plm_base_launch_apps+0x1f6)
> [0x2b04e7ca9210]
> [mbm-01-24:24102] [ 4]
> /sw/openmpi/1.4.1-debug/gcc/4.4.3/lib/libopen-rte.so.0
> [0x2b04e7cb3f2f]
> [mbm-01-24:24102] [ 5] /sw/openmpi/1.4.1-debug/gcc/4.4.3/bin/mpirun [0x403d3b]
> [mbm-01-24:24102] [ 6] /sw/openmpi/1.4.1-debug/gcc/4.4.3/bin/mpirun [0x402ee4]
> [mbm-01-24:24102] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4) 
> [0x2b04e8e9c994]
> [mbm-01-24:24102] [ 8] /sw/openmpi/1.4.1-debug/gcc/4.4.3/bin/mpirun [0x402e09]
> [mbm-01-24:24102] *** End of error message ***
> Segmentation fault
> 
> After applying by hand the r21728 to the original 1.4.1, I can start
> properly the job as expected, the 'PAFFINITY cannot get physical core
> id...' error message doesn't appear anymore, so I'd like to ask for it
> to be applied to the 1.4 series. With this, I've tested the following
> combinations:
> 
> entries in ranks file   -np     result
> 1                             1        OK
> 1                             2        segv
> 1                             4        segv
> 2                             1        OK
> 2                             2        OK
> 2                             4        OK
> 4                             4        OK
> 
> So the segv's really only appear when there's only one entry in the
> ranks file; if I'm the only one to be able to reproduce these segv's,
> I'd be happy to look into it with some guidance about the actual
> source code location...
> 
> Cheers,
> Bogdan
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to