Hi all,

There's a problem with ompi 1.3.4 when compiled with the intel 11.1.059 c 
compiler, related with the built in processor binding functionallity. The 
problem does not occur when ompi is compiled with the gnu c compiler.

A mpi program execution fails (segfault) on mpi_init() when the following rank 
file is used:
 rank 0=node01 slot=0-3
 rank 1=node01 slot=0-3
but runs fine with:
 rank 0=node01 slot=0
 rank 1=node01 slot=1-3
and fine with:
 rank 0=node01 slot=0-1
 rank 1=node01 slot=1-3
but segfaults with:
 rank 0=node01 slot=0-2
 rank 1=node01 slot=1-3

This is on a two-processor quad-core opteron machine (occurs on all nodes of 
the cluster) with Ubuntu 8.10, kernel 2.6.27-16.
This is the siplest case that fails. Generally, I would like to bind processors 
to physical procs but always allow any core, like
 rank 0=node01 slot=p0:0-3
 rank 1=node01 slot=p0:0-3
 rank 2=node01 slot=p0:0-3
 rank 3=node01 slot=p0:0-3
 rank 4=node01 slot=p1:0-3
 rank 5=node01 slot=p1:0-3
 rank 6=node01 slot=p1:0-3
 rank 7=node01 slot=p1:0-3
which fails too.

This happens with a test code that contains only two lines of code, calling 
mpi_init and mpi_finalize subsequently, and happens in both fortran and in c.

One more interesting thing is, that the problem with setting the process 
affinity does not occur on our four-processor quad-core opteron nodes, with 
exactly the same OS etc.


Setting "--mca paffinity_base_verbose 5" shows what is going wrong for this 
rankfile:
 rank 0=node01 slot=0-3
 rank 1=node01 slot=0-3
------------- WRONG -----------------
[node01:23174] mca:base:select:(paffinity) Querying component [linux]
[node01:23174] mca:base:select:(paffinity) Query of component [linux] set 
priority to 10
[node01:23174] mca:base:select:(paffinity) Selected component [linux]
[node01:23174] paffinity slot assignment: slot_list == 0-3
[node01:23174] paffinity slot assignment: rank 0 runs on cpu #0 (#0)
[node01:23174] paffinity slot assignment: rank 0 runs on cpu #1 (#1)
[node01:23174] paffinity slot assignment: rank 0 runs on cpu #2 (#2)
[node01:23174] paffinity slot assignment: rank 0 runs on cpu #3 (#3)
[node01:23174] paffinity slot assignment: slot_list == 0-3
[node01:23174] paffinity slot assignment: rank 1 runs on cpu #0 (#0)
[node01:23174] paffinity slot assignment: rank 1 runs on cpu #1 (#1)
[node01:23174] paffinity slot assignment: rank 1 runs on cpu #2 (#2)
[node01:23174] paffinity slot assignment: rank 1 runs on cpu #3 (#3)
[node01:23175] mca:base:select:(paffinity) Querying component [linux]
[node01:23175] mca:base:select:(paffinity) Query of component [linux] set 
priority to 10
[node01:23175] mca:base:select:(paffinity) Selected component [linux]
[node01:23176] mca:base:select:(paffinity) Querying component [linux]
[node01:23176] mca:base:select:(paffinity) Query of component [linux] set 
priority to 10
[node01:23176] mca:base:select:(paffinity) Selected component [linux]
[node01:23175] paffinity slot assignment: slot_list == 0-3
[node01:23175] paffinity slot assignment: rank 0 runs on cpu #0 (#0)
[node01:23175] paffinity slot assignment: rank 0 runs on cpu #1 (#1)
[node01:23175] paffinity slot assignment: rank 0 runs on cpu #2 (#2)
[node01:23175] paffinity slot assignment: rank 0 runs on cpu #3 (#3)
[node01:23176] paffinity slot assignment: slot_list == 0-3
[node01:23176] paffinity slot assignment: rank 1 runs on cpu #0 (#0)
[node01:23176] paffinity slot assignment: rank 1 runs on cpu #1 (#1)
[node01:23176] paffinity slot assignment: rank 1 runs on cpu #2 (#2)
[node01:23176] paffinity slot assignment: rank 1 runs on cpu #3 (#3)
[node01:23175] *** Process received signal ***
[node01:23176] *** Process received signal ***
[node01:23175] Signal: Segmentation fault (11)
[node01:23175] Signal code: Address not mapped (1)
[node01:23175] Failing at address: 0x30
[node01:23176] Signal: Segmentation fault (11)
[node01:23176] Signal code: Address not mapped (1)
[node01:23176] Failing at address: 0x30
------------- WRONG -----------------

------------- RIGHT -----------------
[node25:23241] mca:base:select:(paffinity) Querying component [linux]
[node25:23241] mca:base:select:(paffinity) Query of component [linux] set 
priority to 10
[node25:23241] mca:base:select:(paffinity) Selected component [linux]
[node25:23241] paffinity slot assignment: slot_list == 0-3
[node25:23241] paffinity slot assignment: rank 0 runs on cpu #0 (#0)
[node25:23241] paffinity slot assignment: rank 0 runs on cpu #1 (#1)
[node25:23241] paffinity slot assignment: rank 0 runs on cpu #2 (#2)
[node25:23241] paffinity slot assignment: rank 0 runs on cpu #3 (#3)
[node25:23241] paffinity slot assignment: slot_list == 0-3
[node25:23241] paffinity slot assignment: rank 1 runs on cpu #0 (#0)
[node25:23241] paffinity slot assignment: rank 1 runs on cpu #1 (#1)
[node25:23241] paffinity slot assignment: rank 1 runs on cpu #2 (#2)
[node25:23241] paffinity slot assignment: rank 1 runs on cpu #3 (#3)
[node25:23242] mca:base:select:(paffinity) Querying component [linux]
[node25:23242] mca:base:select:(paffinity) Query of component [linux] set 
priority to 10
[node25:23242] mca:base:select:(paffinity) Selected component [linux]
[node25:23243] mca:base:select:(paffinity) Querying component [linux]
[node25:23243] mca:base:select:(paffinity) Query of component [linux] set 
priority to 10
[node25:23243] mca:base:select:(paffinity) Selected component [linux]
------------- RIGHT -----------------

Apparently, only a master process (ID [node01:23174] and [node25:23241]) set 
the paffinity in the RIGHT case, but in the WRONG case, also the compute 
processes ([node01:23175] and [node01:23176], rank0 and rank1) try to set the 
their own paffinity properties.



Note that for the rankfile also the notation does not work. But that seems to 
have a different origin, as it tries to bind to a core# 4, whereas there are 
just 0-3.
 rank 0=node01 slot=0:*
 rank 1=node01 slot=0:*


Thanks for your help on this!

--
Daan van Rossum

Reply via email to