On Sat, Feb 27, 2010 at 7:35 PM, Ralph Castain <r...@open-mpi.org> wrote:
> I can't seem to replicate this first problem - it runs fine for me even if 
> the rankfile contains only one entry.

First of all, thanks for taking a look at this !

For me it's repeatable. Please note that I do specify '-np 4' even
when in the ranks file there is only one entry; I've just checked that
this happens with some random value given to -np, the only time I
don't get a segv is with '-np 1' in which case I get the 'PAFFINITY
cannot get physical core id...' error message. However, with other
combinations, like 2 entries in the ranks file and '-np 4' the segv
doesn't appear, only the error message. Anyway, for the original case
(one entry in the ranks file and -np 4'), the output obtained with the
suggested debug is:

[mbm-01-24:24102] mca:base:select:(rmaps) Querying component [load_balance]
[mbm-01-24:24102] mca:base:select:(rmaps) Skipping component
[load_balance]. Query failed to return a module
[mbm-01-24:24102] mca:base:select:(rmaps) Querying component [rank_file]
[mbm-01-24:24102] mca:base:select:(rmaps) Query of component
[rank_file] set priority to 100
[mbm-01-24:24102] mca:base:select:(rmaps) Querying component [round_robin]
[mbm-01-24:24102] mca:base:select:(rmaps) Query of component
[round_robin] set priority to 70
[mbm-01-24:24102] mca:base:select:(rmaps) Querying component [seq]
[mbm-01-24:24102] mca:base:select:(rmaps) Query of component [seq] set
priority to 0
[mbm-01-24:24102] mca:base:select:(rmaps) Selected component [rank_file]
[mbm-01-24:24102] procdir:
/tmp/openmpi-sessions-bq_bcostescu@mbm-01-24_0/36756/0/0
[mbm-01-24:24102] jobdir: /tmp/openmpi-sessions-bq_bcostescu@mbm-01-24_0/36756/0
[mbm-01-24:24102] top: openmpi-sessions-bq_bcostescu@mbm-01-24_0
[mbm-01-24:24102] tmp: /tmp
[mbm-01-24:24102] mpirun: reset PATH:
/sw/openmpi/1.4.1-debug/gcc/4.4.3/bin:/usr/local/bin:/bin:/usr/bin:/home/bq_bcostescu/bin
[mbm-01-24:24102] mpirun: reset LD_LIBRARY_PATH:
/sw/openmpi/1.4.1-debug/gcc/4.4.3/lib
[mbm-01-24:24102] [[36756,0],0] hostfile: checking hostfile hosts for nodes
[mbm-01-24:24102] [[36756,0],0] hostfile: filtering nodes through hostfile hosts
[mbm-01-24:24102] [[36756,0],0] rmaps:base:claim_slot: created new
proc [[36756,1],INVALID]
[mbm-01-24:24102] [[36756,0],0] rmaps:base:claim_slot mapping proc in
job [36756,1] to node mbm-01-24
[mbm-01-24:24102] [[36756,0],0] rmaps:base: adding node mbm-01-24 to map
[mbm-01-24:24102] [[36756,0],0] rmaps:base: mapping proc for job
[36756,1] to node mbm-01-24
[mbm-01-24:24102] [[36756,0],0] hostfile: filtering nodes through hostfile hosts
[mbm-01-24:24102] [[36756,0],0] rmaps:base:claim_slot: created new
proc [[36756,1],INVALID]
[mbm-01-24:24102] [[36756,0],0] rmaps:base:claim_slot mapping proc in
job [36756,1] to node mbm-01-24
[mbm-01-24:24102] [[36756,0],0] rmaps:base: mapping proc for job
[36756,1] to node mbm-01-24
[mbm-01-24:24102] [[36756,0],0] rmaps:base:claim_slot: created new
proc [[36756,1],INVALID]
[mbm-01-24:24102] [[36756,0],0] rmaps:base:claim_slot mapping proc in
job [36756,1] to node mbm-01-24
[mbm-01-24:24102] [[36756,0],0] rmaps:base: mapping proc for job
[36756,1] to node mbm-01-24
[mbm-01-24:24102] [[36756,0],0] rmaps:base:claim_slot: created new
proc [[36756,1],INVALID]
[mbm-01-24:24102] [[36756,0],0] rmaps:base:claim_slot mapping proc in
job [36756,1] to node mbm-01-24
[mbm-01-24:24102] [[36756,0],0] rmaps:base: mapping proc for job
[36756,1] to node mbm-01-24
[mbm-01-24:24102] [[36756,0],0] rmaps:base:compute_usage
[mbm-01-24:24102] [[36756,0],0] rmaps:base:define_daemons
[mbm-01-24:24102] [[36756,0],0] rmaps:base:define_daemons existing
daemon [[36756,0],0] already launched
[mbm-01-24:24102] *** Process received signal ***
[mbm-01-24:24102] Signal: Segmentation fault (11)
[mbm-01-24:24102] Signal code: Address not mapped (1)
[mbm-01-24:24102] Failing at address: 0x70
[mbm-01-24:24102] [ 0] /lib64/libpthread.so.0 [0x2b04e8c727c0]
[mbm-01-24:24102] [ 1]
/sw/openmpi/1.4.1-debug/gcc/4.4.3/lib/libopen-rte.so.0(orte_util_encode_pidmap+0x140)
[0x2b04e7c5b312]
[mbm-01-24:24102] [ 2]
/sw/openmpi/1.4.1-debug/gcc/4.4.3/lib/libopen-rte.so.0(orte_odls_base_default_get_add_procs_data+0xb31)
[0x2b04e7c89557]
[mbm-01-24:24102] [ 3]
/sw/openmpi/1.4.1-debug/gcc/4.4.3/lib/libopen-rte.so.0(orte_plm_base_launch_apps+0x1f6)
[0x2b04e7ca9210]
[mbm-01-24:24102] [ 4]
/sw/openmpi/1.4.1-debug/gcc/4.4.3/lib/libopen-rte.so.0
[0x2b04e7cb3f2f]
[mbm-01-24:24102] [ 5] /sw/openmpi/1.4.1-debug/gcc/4.4.3/bin/mpirun [0x403d3b]
[mbm-01-24:24102] [ 6] /sw/openmpi/1.4.1-debug/gcc/4.4.3/bin/mpirun [0x402ee4]
[mbm-01-24:24102] [ 7] /lib64/libc.so.6(__libc_start_main+0xf4) [0x2b04e8e9c994]
[mbm-01-24:24102] [ 8] /sw/openmpi/1.4.1-debug/gcc/4.4.3/bin/mpirun [0x402e09]
[mbm-01-24:24102] *** End of error message ***
Segmentation fault

After applying by hand the r21728 to the original 1.4.1, I can start
properly the job as expected, the 'PAFFINITY cannot get physical core
id...' error message doesn't appear anymore, so I'd like to ask for it
to be applied to the 1.4 series. With this, I've tested the following
combinations:

entries in ranks file   -np     result
1                             1        OK
1                             2        segv
1                             4        segv
2                             1        OK
2                             2        OK
2                             4        OK
4                             4        OK

So the segv's really only appear when there's only one entry in the
ranks file; if I'm the only one to be able to reproduce these segv's,
I'd be happy to look into it with some guidance about the actual
source code location...

Cheers,
Bogdan

Reply via email to