1. I tried Open MPI 1.5.1 before turning to hwloc-bind. Yep. Open MPI 1.5.1 
does provide the --bycore and --bind-to-core option, but this option seems to 
bind processes to cores on my machine according to the *physical* indexes:

-------------
[user@compute-0-8 ~]$ lstopo --physical
Machine (16GB)
  Socket P#0
    L2 (4096KB)
      L1 (32KB) + Core P#0 + PU P#0
      L1 (32KB) + Core P#1 + PU P#2
    L2 (4096KB)
      L1 (32KB) + Core P#2 + PU P#4
      L1 (32KB) + Core P#3 + PU P#6
  Socket P#1
    L2 (4096KB)
      L1 (32KB) + Core P#0 + PU P#1
      L1 (32KB) + Core P#1 + PU P#3
    L2 (4096KB)
      L1 (32KB) + Core P#2 + PU P#5
      L1 (32KB) + Core P#3 + PU P#7
-------------------

Rank 0 --> PU#0 = socket.0:core.0
Rank 1 --> PU#1 = socket.1:core.0
Rank 2 --> PU#2 = socket.0:core.2
Rank 3 --> PU#3 = socket.1:core.2
Rank 4 --> PU#4 = socket.0:core.1
Rank 5 --> PU#5 = socket.1:core.1
Rank 6 --> PU#6 = socket.0:core.3
Rank 7 --> PU#7 = socket.1:core.3

What I intend to achieve (and verify) is to bind processes following the 
*logical* indexes, i.e.,

Rank 0 --> PU#0 = socket.0:core.0
Rank 1 --> PU#4 = socket.0:core.1
Rank 2 --> PU#2 = socket.0:core.2
Rank 3 --> PU#6 = socket.0:core.3
Rank 4 --> PU#1 = socket.1:core.0
Rank 5 --> PU#5 = socket.1:core.1
Rank 6 --> PU#3 = socket.1:core.2
Rank 7 --> PU#7 = socket.1:core.3

The above specific configuration can be achieved using the -rf option with a 
rank file in OMPI, but it seems to me that the rank file doesn't work in the 
multiple instruction multiple data (MIMD) environment. The complication brought 
me to trying hwloc-bind.

FYI, my testing environment and application imposes these requirements for 
optimum performance:

i. Different binaries optimized for heterogeneous machines. This necessitates  
MIMD, and can be done in OMPI using the -app option (providing an application 
context file).
ii. The application is communication-sensitive. Thus, fine-grained process 
mapping on *machines* and on *cores* is required to minimize inter-machine and 
inter-socket communication costs occurring on the network and on the system 
bus. Specifically, processes should be mapped onto successive cores of one 
socket before the next socket is considered, i.e., socket.0:core0-3, then 
socket.1:core0-3. In this case, the communication among neighboring rank 0-3 
will be confined to socket 0 without going through the system bus. Same for 
rank 4-7 on socket 1. As such, the order of the cores should follow the 
*logical* indexes.

Initially, I tried combining the features of rankfile and appfile, e.g.,

$ cat rankfile8np4
rank 0=compute-0-8 slot=0:0
rank 1=compute-0-8 slot=0:1
rank 2=compute-0-8 slot=0:2
rank 3=compute-0-8 slot=0:3
$ cat rankfile9np4
rank 0=compute-0-9 slot=0:0
rank 1=compute-0-9 slot=0:1
rank 2=compute-0-9 slot=0:2
rank 3=compute-0-9 slot=0:3
$ cat my_appfile_rankfile
--host compute-0-8 -rf rankfile8np4 -np 4 ./test1
--host compute-0-9 -rf rankfile9np4 -np 4 ./test2
$ mpirun -app my_appfile_rankfile

but found out that only the rankfile stated on the first line took effect; the 
second was ignored completely. After some time of googling and trial and error, 
I decided to try an external binder, and this direction led me to hwloc-bind.

Maybe I should bring the issue of rankfile + appfile to the OMPI mailing list.


2. I thought of invoking a script too, but am not sure how to start. Thanks for 
your info. I shall come to back to you if I need further help.


Chan

--- On Mon, 2/14/11, Jeff Squyres <jsquy...@cisco.com> wrote:

From: Jeff Squyres <jsquy...@cisco.com>
Subject: Re: [hwloc-users] hwloc-ps output - how to verify process binding on 
the core level?
To: "Hardware locality user list" <hwloc-us...@open-mpi.org>
List-Post: hwloc-users@lists.open-mpi.org
Date: Monday, February 14, 2011, 7:26 AM




Reply via email to