Re: [hwloc-users] hwloc-ps output - how to verify process binding on the core level?

2011-02-14 Thread Jeff Squyres
On Feb 14, 2011, at 9:35 AM, Siew Yin Chan wrote:

> 1. I tried Open MPI 1.5.1 before turning to hwloc-bind. Yep. Open MPI 1.5.1 
> does provide the --bycore and --bind-to-core option, but this option seems to 
> bind processes to cores on my machine according to the *physical* indexes:

FWIW, you might want to try one of the OMPI 1.5.2 nightly tarballs -- we 
switched the process affinity stuff to hwloc in 1.5.2 (the 1.5.1 stuff uses a 
different mechanism).

> FYI, my testing environment and application imposes these requirements for 
> optimum performance:
> 
> i. Different binaries optimized for heterogeneous machines. This necessitates 
>  MIMD, and can be done in OMPI using the -app option (providing an 
> application context file).
> ii. The application is communication-sensitive. Thus, fine-grained process 
> mapping on *machines* and on *cores* is required to minimize inter-machine 
> and inter-socket communication costs occurring on the network and on the 
> system bus. Specifically, processes should be mapped onto successive cores of 
> one socket before the next socket is considered, i.e., socket.0:core0-3, then 
> socket.1:core0-3. In this case, the communication among neighboring rank 0-3 
> will be confined to socket 0 without going through the system bus. Same for 
> rank 4-7 on socket 1. As such, the order of the cores should follow the 
> *logical* indexes.

I think that OMPI 1.5.2 should do this for you -- rather than following and 
logical/physical ordering, it does what you describe: traverses successive 
cores on a socket before going to the next socket (which happens to correspond 
to hwloc's logical ordering, but that was not the intent).

FWIW, we have a huge revamp of OMPI's affinity support on the mpirun command 
line that will offer much more flexible binding choices.

> Initially, I tried combining the features of rankfile and appfile, e.g.,
> 
> $ cat rankfile8np4
> rank 0=compute-0-8 slot=0:0
> rank 1=compute-0-8 slot=0:1
> rank 2=compute-0-8 slot=0:2
> rank 3=compute-0-8 slot=0:3
> $ cat rankfile9np4
> rank 0=compute-0-9 slot=0:0
> rank 1=compute-0-9 slot=0:1
> rank 2=compute-0-9 slot=0:2
> rank 3=compute-0-9 slot=0:3
> $ cat my_appfile_rankfile
> --host compute-0-8 -rf rankfile8np4 -np 4 ./test1
> --host compute-0-9 -rf rankfile9np4 -np 4 ./test2
> $ mpirun -app my_appfile_rankfile
> 
> but found out that only the rankfile stated on the first line took effect; 
> the second was ignored completely. After some time of googling and trial and 
> error, I decided to try an external binder, and this direction led me to 
> hwloc-bind.
> 
> Maybe I should bring the issue of rankfile + appfile to the OMPI mailing list.

Yes.  

I'd have to look at it more closely, but it's possible that we only allow one 
rankfile per job -- i.e., that the rankfile should specify all the procs in the 
job, not on a per-host basis.  But perhaps we don't warn/error if multiple 
rankfiles are used; I would consider that a bug.

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to:
http://www.cisco.com/web/about/doing_business/legal/cri/




Re: [hwloc-users] hwloc-ps output - how to verify process binding on the core level?

2011-02-14 Thread Samuel Thibault
Brice Goglin, le Mon 14 Feb 2011 07:56:56 +0100, a écrit :
> The operating system decides where each process runs (according to the
> binding). It usually has no knowledge of MPI ranks. And I don't think it looks
> at the PID numbers during the scheduling.

It doesn't either, indeed.

Samuel


Re: [hwloc-users] hwloc-ps output - how to verify process binding on the core level?

2011-02-14 Thread Brice Goglin
Le 14/02/2011 07:43, Siew Yin Chan a écrit :
>
>>
>>
>
> No. Each hwloc-bind command in the mpirun above doesn't know that
> there are other hwloc-bind instances on the same machine. All of
> them bind their process to all cores in the first socket.
>
> => Agree. For socket:0.core:0-3 , hwloc will bind the MPI processes to
> all cores in the first socket. But how are the individual processes
> mapped on these cores? Will it be in this order:
>
>
> rank 0 à core 0
>
> rank 1 à core 1
>
> rank 2 à core 2
>
> rank 3 à core 3
>
>
> Or in this *arbitrary* order:
>
>
> rank 0 à core 1
>
> rank 1 à core 3
>
> rank 2 à core 0
>
> rank 3 à core 2
>

The operating system decides where each process runs (according to the
binding). It usually has no knowledge of MPI ranks. And I don't think it
looks at the PID numbers during the scheduling. So it's very likely random.


Please distinguish your replies from the test you quote. Otherwise, it's
hard to understand where's your reply. I hope I didn't miss anything.

Brice