Re: [hwloc-users] hwloc-ps output - how to verify process binding on the core level?

2011-02-14 Thread Jeff Squyres
On Feb 14, 2011, at 8:15 PM, Siew Yin Chan wrote:

> Thank you very much for your input which makes my direction pretty clear now. 
> Depending on the progress of my project, I may be adventurous to try the 
> nightly tarball, or may wait until a stable version is released.

FWIW, we release 1.5.2rc1 today.  It contains the hwloc stuff.

> I appreciate the hard work of the OMPI team, and am look forward to a more 
> flexible binding option in OMPI's future release.

Thanks!  We're shooting for 1.5.3, but it might slip to 1.5.4.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [hwloc-users] hwloc-ps output - how to verify process binding on the core level?

2011-02-14 Thread Siew Yin Chan
Jeff Squyres,

Thank you very much for your input which makes my direction pretty clear now. 
Depending on the progress of my project, I may be adventurous to try the 
nightly tarball, or may wait until a stable version is released.

I appreciate the hard work of the OMPI team, and am look forward to a more 
flexible binding option in OMPI's future release.


Chan


--- On Mon, 2/14/11, Jeff Squyres  wrote:

> From: Jeff Squyres 
> Subject: Re: [hwloc-users] hwloc-ps output - how to verify process binding on 
> the core level?
> To: "Hardware locality user list" 
> Date: Monday, February 14, 2011, 8:53 AM
> On Feb 14, 2011, at 9:35 AM, Siew Yin
> Chan wrote:
> 
> > 1. I tried Open MPI 1.5.1 before turning to
> hwloc-bind. Yep. Open MPI 1.5.1 does provide the --bycore
> and --bind-to-core option, but this option seems to bind
> processes to cores on my machine according to the *physical*
> indexes:
> 
> FWIW, you might want to try one of the OMPI 1.5.2 nightly
> tarballs -- we switched the process affinity stuff to hwloc
> in 1.5.2 (the 1.5.1 stuff uses a different mechanism).
> 
> > FYI, my testing environment and application imposes
> these requirements for optimum performance:
> > 
> > i. Different binaries optimized for heterogeneous
> machines. This necessitates  MIMD, and can be done in
> OMPI using the -app option (providing an application context
> file).
> > ii. The application is communication-sensitive. Thus,
> fine-grained process mapping on *machines* and on *cores* is
> required to minimize inter-machine and inter-socket
> communication costs occurring on the network and on the
> system bus. Specifically, processes should be mapped onto
> successive cores of one socket before the next socket is
> considered, i.e., socket.0:core0-3, then socket.1:core0-3.
> In this case, the communication among neighboring rank 0-3
> will be confined to socket 0 without going through the
> system bus. Same for rank 4-7 on socket 1. As such, the
> order of the cores should follow the *logical* indexes.
> 
> I think that OMPI 1.5.2 should do this for you -- rather
> than following and logical/physical ordering, it does what
> you describe: traverses successive cores on a socket before
> going to the next socket (which happens to correspond to
> hwloc's logical ordering, but that was not the intent).
> 
> FWIW, we have a huge revamp of OMPI's affinity support on
> the mpirun command line that will offer much more flexible
> binding choices.
> 
> > Initially, I tried combining the features of rankfile
> and appfile, e.g.,
> > 
> > $ cat rankfile8np4
> > rank 0=compute-0-8 slot=0:0
> > rank 1=compute-0-8 slot=0:1
> > rank 2=compute-0-8 slot=0:2
> > rank 3=compute-0-8 slot=0:3
> > $ cat rankfile9np4
> > rank 0=compute-0-9 slot=0:0
> > rank 1=compute-0-9 slot=0:1
> > rank 2=compute-0-9 slot=0:2
> > rank 3=compute-0-9 slot=0:3
> > $ cat my_appfile_rankfile
> > --host compute-0-8 -rf rankfile8np4 -np 4 ./test1
> > --host compute-0-9 -rf rankfile9np4 -np 4 ./test2
> > $ mpirun -app my_appfile_rankfile
> > 
> > but found out that only the rankfile stated on the
> first line took effect; the second was ignored completely.
> After some time of googling and trial and error, I decided
> to try an external binder, and this direction led me to
> hwloc-bind.
> > 
> > Maybe I should bring the issue of rankfile + appfile
> to the OMPI mailing list.
> 
> Yes.  
> 
> I'd have to look at it more closely, but it's possible that
> we only allow one rankfile per job -- i.e., that the
> rankfile should specify all the procs in the job, not on a
> per-host basis.  But perhaps we don't warn/error if
> multiple rankfiles are used; I would consider that a bug.
> 
> -- 
> Jeff Squyres
> jsquy...@cisco.com
> For corporate legal information go to:
> http://www.cisco.com/web/about/doing_business/legal/cri/
> 
> 
> ___
> hwloc-users mailing list
> hwloc-us...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users
> 




TV dinner still cooling? 
Check out "Tonight's Picks" on Yahoo! TV.
http://tv.yahoo.com/



Re: [hwloc-users] hwloc-ps output - how to verify process binding on the core level?

2011-02-14 Thread Jeff Squyres
On Feb 14, 2011, at 9:35 AM, Siew Yin Chan wrote:

> 1. I tried Open MPI 1.5.1 before turning to hwloc-bind. Yep. Open MPI 1.5.1 
> does provide the --bycore and --bind-to-core option, but this option seems to 
> bind processes to cores on my machine according to the *physical* indexes:

FWIW, you might want to try one of the OMPI 1.5.2 nightly tarballs -- we 
switched the process affinity stuff to hwloc in 1.5.2 (the 1.5.1 stuff uses a 
different mechanism).

> FYI, my testing environment and application imposes these requirements for 
> optimum performance:
> 
> i. Different binaries optimized for heterogeneous machines. This necessitates 
>  MIMD, and can be done in OMPI using the -app option (providing an 
> application context file).
> ii. The application is communication-sensitive. Thus, fine-grained process 
> mapping on *machines* and on *cores* is required to minimize inter-machine 
> and inter-socket communication costs occurring on the network and on the 
> system bus. Specifically, processes should be mapped onto successive cores of 
> one socket before the next socket is considered, i.e., socket.0:core0-3, then 
> socket.1:core0-3. In this case, the communication among neighboring rank 0-3 
> will be confined to socket 0 without going through the system bus. Same for 
> rank 4-7 on socket 1. As such, the order of the cores should follow the 
> *logical* indexes.

I think that OMPI 1.5.2 should do this for you -- rather than following and 
logical/physical ordering, it does what you describe: traverses successive 
cores on a socket before going to the next socket (which happens to correspond 
to hwloc's logical ordering, but that was not the intent).

FWIW, we have a huge revamp of OMPI's affinity support on the mpirun command 
line that will offer much more flexible binding choices.

> Initially, I tried combining the features of rankfile and appfile, e.g.,
> 
> $ cat rankfile8np4
> rank 0=compute-0-8 slot=0:0
> rank 1=compute-0-8 slot=0:1
> rank 2=compute-0-8 slot=0:2
> rank 3=compute-0-8 slot=0:3
> $ cat rankfile9np4
> rank 0=compute-0-9 slot=0:0
> rank 1=compute-0-9 slot=0:1
> rank 2=compute-0-9 slot=0:2
> rank 3=compute-0-9 slot=0:3
> $ cat my_appfile_rankfile
> --host compute-0-8 -rf rankfile8np4 -np 4 ./test1
> --host compute-0-9 -rf rankfile9np4 -np 4 ./test2
> $ mpirun -app my_appfile_rankfile
> 
> but found out that only the rankfile stated on the first line took effect; 
> the second was ignored completely. After some time of googling and trial and 
> error, I decided to try an external binder, and this direction led me to 
> hwloc-bind.
> 
> Maybe I should bring the issue of rankfile + appfile to the OMPI mailing list.

Yes.  

I'd have to look at it more closely, but it's possible that we only allow one 
rankfile per job -- i.e., that the rankfile should specify all the procs in the 
job, not on a per-host basis.  But perhaps we don't warn/error if multiple 
rankfiles are used; I would consider that a bug.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [hwloc-users] hwloc-ps output - how to verify process binding on the core level?

2011-02-14 Thread Siew Yin Chan

1. I tried Open MPI 1.5.1 before turning to hwloc-bind. Yep. Open MPI 1.5.1 
does provide the --bycore and --bind-to-core option, but this option seems to 
bind processes to cores on my machine according to the *physical* indexes:

-
[user@compute-0-8 ~]$ lstopo --physical
Machine (16GB)
  Socket P#0
L2 (4096KB)
  L1 (32KB) + Core P#0 + PU P#0
  L1 (32KB) + Core P#1 + PU P#2
L2 (4096KB)
  L1 (32KB) + Core P#2 + PU P#4
  L1 (32KB) + Core P#3 + PU P#6
  Socket P#1
L2 (4096KB)
  L1 (32KB) + Core P#0 + PU P#1
  L1 (32KB) + Core P#1 + PU P#3
L2 (4096KB)
  L1 (32KB) + Core P#2 + PU P#5
  L1 (32KB) + Core P#3 + PU P#7
---

Rank 0 --> PU#0 = socket.0:core.0
Rank 1 --> PU#1 = socket.1:core.0
Rank 2 --> PU#2 = socket.0:core.2
Rank 3 --> PU#3 = socket.1:core.2
Rank 4 --> PU#4 = socket.0:core.1
Rank 5 --> PU#5 = socket.1:core.1
Rank 6 --> PU#6 = socket.0:core.3
Rank 7 --> PU#7 = socket.1:core.3

What I intend to achieve (and verify) is to bind processes following the 
*logical* indexes, i.e.,

Rank 0 --> PU#0 = socket.0:core.0
Rank 1 --> PU#4 = socket.0:core.1
Rank 2 --> PU#2 = socket.0:core.2
Rank 3 --> PU#6 = socket.0:core.3
Rank 4 --> PU#1 = socket.1:core.0
Rank 5 --> PU#5 = socket.1:core.1
Rank 6 --> PU#3 = socket.1:core.2
Rank 7 --> PU#7 = socket.1:core.3

The above specific configuration can be achieved using the -rf option with a 
rank file in OMPI, but it seems to me that the rank file doesn't work in the 
multiple instruction multiple data (MIMD) environment. The complication brought 
me to trying hwloc-bind.

FYI, my testing environment and application imposes these requirements for 
optimum performance:

i. Different binaries optimized for heterogeneous machines. This necessitates  
MIMD, and can be done in OMPI using the -app option (providing an application 
context file).
ii. The application is communication-sensitive. Thus, fine-grained process 
mapping on *machines* and on *cores* is required to minimize inter-machine and 
inter-socket communication costs occurring on the network and on the system 
bus. Specifically, processes should be mapped onto successive cores of one 
socket before the next socket is considered, i.e., socket.0:core0-3, then 
socket.1:core0-3. In this case, the communication among neighboring rank 0-3 
will be confined to socket 0 without going through the system bus. Same for 
rank 4-7 on socket 1. As such, the order of the cores should follow the 
*logical* indexes.

Initially, I tried combining the features of rankfile and appfile, e.g.,

$ cat rankfile8np4
rank 0=compute-0-8 slot=0:0
rank 1=compute-0-8 slot=0:1
rank 2=compute-0-8 slot=0:2
rank 3=compute-0-8 slot=0:3
$ cat rankfile9np4
rank 0=compute-0-9 slot=0:0
rank 1=compute-0-9 slot=0:1
rank 2=compute-0-9 slot=0:2
rank 3=compute-0-9 slot=0:3
$ cat my_appfile_rankfile
--host compute-0-8 -rf rankfile8np4 -np 4 ./test1
--host compute-0-9 -rf rankfile9np4 -np 4 ./test2
$ mpirun -app my_appfile_rankfile

but found out that only the rankfile stated on the first line took effect; the 
second was ignored completely. After some time of googling and trial and error, 
I decided to try an external binder, and this direction led me to hwloc-bind.

Maybe I should bring the issue of rankfile + appfile to the OMPI mailing list.


2. I thought of invoking a script too, but am not sure how to start. Thanks for 
your info. I shall come to back to you if I need further help.


Chan

--- On Mon, 2/14/11, Jeff Squyres  wrote:

From: Jeff Squyres 
Subject: Re: [hwloc-users] hwloc-ps output - how to verify process binding on 
the core level?
To: "Hardware locality user list" 
List-Post: hwloc-users@lists.open-mpi.org
Date: Monday, February 14, 2011, 7:26 AM






Re: [hwloc-users] hwloc-ps output - how to verify process binding on the core level?

2011-02-14 Thread Jeff Squyres
On Feb 13, 2011, at 4:07 AM, Brice Goglin wrote:

>> $ mpirun -np 4 hwloc-bind socket:0.core:0-3 ./test
>> 
>> 1. Does hwloc-bind map the processes *sequentially* on *successive* cores of 
>> the socket?
> 
> No. Each hwloc-bind command in the mpirun above doesn't know that there are 
> other hwloc-bind instances on the same machine. All of them bind their 
> process to all cores in the first socket.

To further underscore this point, mpirun launched 4 copies of:

hwloc-bind socket:0.core:0-3 ./test

Which means that all 4 processes bound to exactly the same thing.

If you want each process to bind to a *different* set of PU's, then you have 
two choices:

1. See Open MPI 1.5.1's mpirun(1) man page.  There's new affinity options in 
the OMPI 1.5 series, such as --bind-to-core and --bind-to-socket.  We wrote 
them up in the FAQ, too.

2. Write a wrapper script that looks at the Open MPI environment variables 
OMPI_COMM_WORLD_RANK, or OMPI_COMM_WORLD_LOCAL_RANK, or 
OMPI_COMM_WORLD_NODE_RANK and decides how to invoke hwloc-bind.  For example, 
something like this:

mpirun -np 4 my_wrapper.sh ./test

where my_wrapper.sh is:

-
#!/bin/sh

if test "$OMPI_COMM_WORLD_RANK" = "0"; then
bind_string=...whatever...
else
bind_string=...whatever...
fi
hwloc-bind $bind_string $*
-

Something like that.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [hwloc-users] hwloc-ps output - how to verify process binding on the core level?

2011-02-14 Thread Samuel Thibault
Brice Goglin, le Mon 14 Feb 2011 07:56:56 +0100, a écrit :
> The operating system decides where each process runs (according to the
> binding). It usually has no knowledge of MPI ranks. And I don't think it looks
> at the PID numbers during the scheduling.

It doesn't either, indeed.

Samuel


Re: [hwloc-users] hwloc-ps output - how to verify process binding on the core level?

2011-02-14 Thread Brice Goglin
Le 14/02/2011 07:43, Siew Yin Chan a écrit :
>
>>
>>
>
> No. Each hwloc-bind command in the mpirun above doesn't know that
> there are other hwloc-bind instances on the same machine. All of
> them bind their process to all cores in the first socket.
>
> => Agree. For socket:0.core:0-3 , hwloc will bind the MPI processes to
> all cores in the first socket. But how are the individual processes
> mapped on these cores? Will it be in this order:
>
>
> rank 0 à core 0
>
> rank 1 à core 1
>
> rank 2 à core 2
>
> rank 3 à core 3
>
>
> Or in this *arbitrary* order:
>
>
> rank 0 à core 1
>
> rank 1 à core 3
>
> rank 2 à core 0
>
> rank 3 à core 2
>

The operating system decides where each process runs (according to the
binding). It usually has no knowledge of MPI ranks. And I don't think it
looks at the PID numbers during the scheduling. So it's very likely random.


Please distinguish your replies from the test you quote. Otherwise, it's
hard to understand where's your reply. I hope I didn't miss anything.

Brice




Re: [hwloc-users] hwloc-ps output - how to verify process binding on the core level?

2011-02-14 Thread Siew Yin Chan
--- On Sun, 2/13/11, Brice Goglin  wrote:
From: Brice Goglin 
Subject: Re: [hwloc-users] hwloc-ps output - how to verify process binding on 
the core level?
To: "Hardware locality user list" 
List-Post: hwloc-users@lists.open-mpi.org
Date: Sunday, February 13, 2011, 3:07 AM





  

 
Le 13/02/2011 04:54, Siew Yin Chan a écrit :

  

  
Good
day,



I'm
studying the impact of MPI process binding on communication costs in my
project, and would like to use hwloc-bind to achieve fine-grained
mapping control. I install hwloc 1.1.1 on a 2-socket 4-core machine
(with 2 dual-core dies in each socket), and run hwloc-ps to verify the
binding:




$ mpirun -V
mpirun (Open MPI) 1.5.1

$
mpirun -np 4 hwloc-bind socket:0.core:0-3 ./test



hwloc-ps
shows the following output:




$ hwloc-ps -p
1497    Socket:0                ./test
1498    Socket:0                ./test
1499    Socket:0                ./test
1500    Socket:0                ./test
$ hwloc-ps -l
1497    Socket:0                ./test
1498    Socket:0                ./test
1499    Socket:0                ./test
1500    Socket:0                ./test
$ hwloc-ps -c
1497    0x0055              ./test
1498    0x0055              ./test
1499    0x0055              ./test
1500    0x0055              ./test







Questions: 
1. Does hwloc-bind map
the processes *sequentially* on *successive* cores of the socket?

  

  



Hello,



No. Each hwloc-bind command in the mpirun above doesn't know that there
are other hwloc-bind instances on the same machine. All of them bind
their process to all cores in the first socket.
=> Agree. For socket:0.core:0-3 , hwloc will bind the MPI processes to all 
cores in the first socket. But how are the individual processes mapped on these 
cores? Will it be in this order:
rank 0 à core 0rank 1 à core 1rank 2 à core 2rank 3 à core 3
Or in this *arbitrary* order:
rank 0 à core 1rank 1 à core 3rank 2 à core 0rank 3 à core 2



  

  

2. How could hwloc-ps
help verify this binding, i.e.,



process 1497 (rank 0) on
socket.0:core.0
process 1498 (rank
1) on socket.0:core.1
process 1499 (rank
2) on socket.0:core.2
process
1500 (rank

3) on socket.0:core.3

  

  



(let's assume your mpirun command did what you want)



You would get something like this from hwloc-ps:



1497    Core:0    ./test



1498    Core:1    ./test



1499    Core:2    ./test



1500    Core:0    ./test







These core numbers are the logical core number among the entire
machine. hwloc-ps can't easily show hierarchical location such as
socket.core since there are many possible combinations, especially
because of caches.



Actually, you might get L1Cache instead of Core above since hwloc-ps
reports the first object that exactly matches the process binding (and
L1 is above but equal to Core in your machine).



If you want to get other output, I suggest you use hwloc-calc to
convert the hwloc-ps output.




  

  

Equivalently,
does the binding of `socket:0.core:0-1 socket:1.core:0-1' with hwloc-ps
showing




$
hwloc-ps -l
1315
   L2Cache:0 L2Cache:2             ./test
1316
   L2Cache:0 L2Cache:2             ./test
1317
   L2Cache:0 L2Cache:2             ./test
1318
   L2Cache:0 L2Cache:2             ./test




indicate the
the following? I.e.,




process 1315 (rank
0) on socket.0:core.0
process 1316 (rank
1) on socket.0:core.1
process 1317 (rank
2) on socket.1:core.0
process 1318 (rank
3) on socket.1:core.1


  

  



No. Again, all processes are bound to 4 different cores, so hwloc-ps
shows the largest objects containing those cores.





In the end, you want a MPI launcher that takes care of the binding
instead of having to manually bind on the command line. It should be
the case with most MPI launchers nowadays. Once this is ok, hwloc-ps
will report this exact core where you bound. And you might need to play
with hwloc-calc to rewrite the hwloc-ps output as you want.



I am thinking of adding an option to hwloc-calc to help rewriting a
random string into socket:X.core:Y or something like that.



Brice



 


-Inline Attachment Follows-

___
hwloc-users mailing list
hwloc-us...@open-mpi.org
http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users



 

Sucker-punch spam with award-winning