Re: [hwloc-users] hwloc-ps output - how to verify process binding on the core level?
On Feb 14, 2011, at 8:15 PM, Siew Yin Chan wrote: > Thank you very much for your input which makes my direction pretty clear now. > Depending on the progress of my project, I may be adventurous to try the > nightly tarball, or may wait until a stable version is released. FWIW, we release 1.5.2rc1 today. It contains the hwloc stuff. > I appreciate the hard work of the OMPI team, and am look forward to a more > flexible binding option in OMPI's future release. Thanks! We're shooting for 1.5.3, but it might slip to 1.5.4. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [hwloc-users] hwloc-ps output - how to verify process binding on the core level?
Jeff Squyres, Thank you very much for your input which makes my direction pretty clear now. Depending on the progress of my project, I may be adventurous to try the nightly tarball, or may wait until a stable version is released. I appreciate the hard work of the OMPI team, and am look forward to a more flexible binding option in OMPI's future release. Chan --- On Mon, 2/14/11, Jeff Squyres wrote: > From: Jeff Squyres > Subject: Re: [hwloc-users] hwloc-ps output - how to verify process binding on > the core level? > To: "Hardware locality user list" > Date: Monday, February 14, 2011, 8:53 AM > On Feb 14, 2011, at 9:35 AM, Siew Yin > Chan wrote: > > > 1. I tried Open MPI 1.5.1 before turning to > hwloc-bind. Yep. Open MPI 1.5.1 does provide the --bycore > and --bind-to-core option, but this option seems to bind > processes to cores on my machine according to the *physical* > indexes: > > FWIW, you might want to try one of the OMPI 1.5.2 nightly > tarballs -- we switched the process affinity stuff to hwloc > in 1.5.2 (the 1.5.1 stuff uses a different mechanism). > > > FYI, my testing environment and application imposes > these requirements for optimum performance: > > > > i. Different binaries optimized for heterogeneous > machines. This necessitates MIMD, and can be done in > OMPI using the -app option (providing an application context > file). > > ii. The application is communication-sensitive. Thus, > fine-grained process mapping on *machines* and on *cores* is > required to minimize inter-machine and inter-socket > communication costs occurring on the network and on the > system bus. Specifically, processes should be mapped onto > successive cores of one socket before the next socket is > considered, i.e., socket.0:core0-3, then socket.1:core0-3. > In this case, the communication among neighboring rank 0-3 > will be confined to socket 0 without going through the > system bus. Same for rank 4-7 on socket 1. As such, the > order of the cores should follow the *logical* indexes. > > I think that OMPI 1.5.2 should do this for you -- rather > than following and logical/physical ordering, it does what > you describe: traverses successive cores on a socket before > going to the next socket (which happens to correspond to > hwloc's logical ordering, but that was not the intent). > > FWIW, we have a huge revamp of OMPI's affinity support on > the mpirun command line that will offer much more flexible > binding choices. > > > Initially, I tried combining the features of rankfile > and appfile, e.g., > > > > $ cat rankfile8np4 > > rank 0=compute-0-8 slot=0:0 > > rank 1=compute-0-8 slot=0:1 > > rank 2=compute-0-8 slot=0:2 > > rank 3=compute-0-8 slot=0:3 > > $ cat rankfile9np4 > > rank 0=compute-0-9 slot=0:0 > > rank 1=compute-0-9 slot=0:1 > > rank 2=compute-0-9 slot=0:2 > > rank 3=compute-0-9 slot=0:3 > > $ cat my_appfile_rankfile > > --host compute-0-8 -rf rankfile8np4 -np 4 ./test1 > > --host compute-0-9 -rf rankfile9np4 -np 4 ./test2 > > $ mpirun -app my_appfile_rankfile > > > > but found out that only the rankfile stated on the > first line took effect; the second was ignored completely. > After some time of googling and trial and error, I decided > to try an external binder, and this direction led me to > hwloc-bind. > > > > Maybe I should bring the issue of rankfile + appfile > to the OMPI mailing list. > > Yes. > > I'd have to look at it more closely, but it's possible that > we only allow one rankfile per job -- i.e., that the > rankfile should specify all the procs in the job, not on a > per-host basis. But perhaps we don't warn/error if > multiple rankfiles are used; I would consider that a bug. > > -- > Jeff Squyres > jsquy...@cisco.com > For corporate legal information go to: > http://www.cisco.com/web/about/doing_business/legal/cri/ > > > ___ > hwloc-users mailing list > hwloc-us...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users > TV dinner still cooling? Check out "Tonight's Picks" on Yahoo! TV. http://tv.yahoo.com/
Re: [hwloc-users] hwloc-ps output - how to verify process binding on the core level?
On Feb 14, 2011, at 9:35 AM, Siew Yin Chan wrote: > 1. I tried Open MPI 1.5.1 before turning to hwloc-bind. Yep. Open MPI 1.5.1 > does provide the --bycore and --bind-to-core option, but this option seems to > bind processes to cores on my machine according to the *physical* indexes: FWIW, you might want to try one of the OMPI 1.5.2 nightly tarballs -- we switched the process affinity stuff to hwloc in 1.5.2 (the 1.5.1 stuff uses a different mechanism). > FYI, my testing environment and application imposes these requirements for > optimum performance: > > i. Different binaries optimized for heterogeneous machines. This necessitates > MIMD, and can be done in OMPI using the -app option (providing an > application context file). > ii. The application is communication-sensitive. Thus, fine-grained process > mapping on *machines* and on *cores* is required to minimize inter-machine > and inter-socket communication costs occurring on the network and on the > system bus. Specifically, processes should be mapped onto successive cores of > one socket before the next socket is considered, i.e., socket.0:core0-3, then > socket.1:core0-3. In this case, the communication among neighboring rank 0-3 > will be confined to socket 0 without going through the system bus. Same for > rank 4-7 on socket 1. As such, the order of the cores should follow the > *logical* indexes. I think that OMPI 1.5.2 should do this for you -- rather than following and logical/physical ordering, it does what you describe: traverses successive cores on a socket before going to the next socket (which happens to correspond to hwloc's logical ordering, but that was not the intent). FWIW, we have a huge revamp of OMPI's affinity support on the mpirun command line that will offer much more flexible binding choices. > Initially, I tried combining the features of rankfile and appfile, e.g., > > $ cat rankfile8np4 > rank 0=compute-0-8 slot=0:0 > rank 1=compute-0-8 slot=0:1 > rank 2=compute-0-8 slot=0:2 > rank 3=compute-0-8 slot=0:3 > $ cat rankfile9np4 > rank 0=compute-0-9 slot=0:0 > rank 1=compute-0-9 slot=0:1 > rank 2=compute-0-9 slot=0:2 > rank 3=compute-0-9 slot=0:3 > $ cat my_appfile_rankfile > --host compute-0-8 -rf rankfile8np4 -np 4 ./test1 > --host compute-0-9 -rf rankfile9np4 -np 4 ./test2 > $ mpirun -app my_appfile_rankfile > > but found out that only the rankfile stated on the first line took effect; > the second was ignored completely. After some time of googling and trial and > error, I decided to try an external binder, and this direction led me to > hwloc-bind. > > Maybe I should bring the issue of rankfile + appfile to the OMPI mailing list. Yes. I'd have to look at it more closely, but it's possible that we only allow one rankfile per job -- i.e., that the rankfile should specify all the procs in the job, not on a per-host basis. But perhaps we don't warn/error if multiple rankfiles are used; I would consider that a bug. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [hwloc-users] hwloc-ps output - how to verify process binding on the core level?
1. I tried Open MPI 1.5.1 before turning to hwloc-bind. Yep. Open MPI 1.5.1 does provide the --bycore and --bind-to-core option, but this option seems to bind processes to cores on my machine according to the *physical* indexes: - [user@compute-0-8 ~]$ lstopo --physical Machine (16GB) Socket P#0 L2 (4096KB) L1 (32KB) + Core P#0 + PU P#0 L1 (32KB) + Core P#1 + PU P#2 L2 (4096KB) L1 (32KB) + Core P#2 + PU P#4 L1 (32KB) + Core P#3 + PU P#6 Socket P#1 L2 (4096KB) L1 (32KB) + Core P#0 + PU P#1 L1 (32KB) + Core P#1 + PU P#3 L2 (4096KB) L1 (32KB) + Core P#2 + PU P#5 L1 (32KB) + Core P#3 + PU P#7 --- Rank 0 --> PU#0 = socket.0:core.0 Rank 1 --> PU#1 = socket.1:core.0 Rank 2 --> PU#2 = socket.0:core.2 Rank 3 --> PU#3 = socket.1:core.2 Rank 4 --> PU#4 = socket.0:core.1 Rank 5 --> PU#5 = socket.1:core.1 Rank 6 --> PU#6 = socket.0:core.3 Rank 7 --> PU#7 = socket.1:core.3 What I intend to achieve (and verify) is to bind processes following the *logical* indexes, i.e., Rank 0 --> PU#0 = socket.0:core.0 Rank 1 --> PU#4 = socket.0:core.1 Rank 2 --> PU#2 = socket.0:core.2 Rank 3 --> PU#6 = socket.0:core.3 Rank 4 --> PU#1 = socket.1:core.0 Rank 5 --> PU#5 = socket.1:core.1 Rank 6 --> PU#3 = socket.1:core.2 Rank 7 --> PU#7 = socket.1:core.3 The above specific configuration can be achieved using the -rf option with a rank file in OMPI, but it seems to me that the rank file doesn't work in the multiple instruction multiple data (MIMD) environment. The complication brought me to trying hwloc-bind. FYI, my testing environment and application imposes these requirements for optimum performance: i. Different binaries optimized for heterogeneous machines. This necessitates MIMD, and can be done in OMPI using the -app option (providing an application context file). ii. The application is communication-sensitive. Thus, fine-grained process mapping on *machines* and on *cores* is required to minimize inter-machine and inter-socket communication costs occurring on the network and on the system bus. Specifically, processes should be mapped onto successive cores of one socket before the next socket is considered, i.e., socket.0:core0-3, then socket.1:core0-3. In this case, the communication among neighboring rank 0-3 will be confined to socket 0 without going through the system bus. Same for rank 4-7 on socket 1. As such, the order of the cores should follow the *logical* indexes. Initially, I tried combining the features of rankfile and appfile, e.g., $ cat rankfile8np4 rank 0=compute-0-8 slot=0:0 rank 1=compute-0-8 slot=0:1 rank 2=compute-0-8 slot=0:2 rank 3=compute-0-8 slot=0:3 $ cat rankfile9np4 rank 0=compute-0-9 slot=0:0 rank 1=compute-0-9 slot=0:1 rank 2=compute-0-9 slot=0:2 rank 3=compute-0-9 slot=0:3 $ cat my_appfile_rankfile --host compute-0-8 -rf rankfile8np4 -np 4 ./test1 --host compute-0-9 -rf rankfile9np4 -np 4 ./test2 $ mpirun -app my_appfile_rankfile but found out that only the rankfile stated on the first line took effect; the second was ignored completely. After some time of googling and trial and error, I decided to try an external binder, and this direction led me to hwloc-bind. Maybe I should bring the issue of rankfile + appfile to the OMPI mailing list. 2. I thought of invoking a script too, but am not sure how to start. Thanks for your info. I shall come to back to you if I need further help. Chan --- On Mon, 2/14/11, Jeff Squyres wrote: From: Jeff Squyres Subject: Re: [hwloc-users] hwloc-ps output - how to verify process binding on the core level? To: "Hardware locality user list" List-Post: hwloc-users@lists.open-mpi.org Date: Monday, February 14, 2011, 7:26 AM
Re: [hwloc-users] hwloc-ps output - how to verify process binding on the core level?
On Feb 13, 2011, at 4:07 AM, Brice Goglin wrote: >> $ mpirun -np 4 hwloc-bind socket:0.core:0-3 ./test >> >> 1. Does hwloc-bind map the processes *sequentially* on *successive* cores of >> the socket? > > No. Each hwloc-bind command in the mpirun above doesn't know that there are > other hwloc-bind instances on the same machine. All of them bind their > process to all cores in the first socket. To further underscore this point, mpirun launched 4 copies of: hwloc-bind socket:0.core:0-3 ./test Which means that all 4 processes bound to exactly the same thing. If you want each process to bind to a *different* set of PU's, then you have two choices: 1. See Open MPI 1.5.1's mpirun(1) man page. There's new affinity options in the OMPI 1.5 series, such as --bind-to-core and --bind-to-socket. We wrote them up in the FAQ, too. 2. Write a wrapper script that looks at the Open MPI environment variables OMPI_COMM_WORLD_RANK, or OMPI_COMM_WORLD_LOCAL_RANK, or OMPI_COMM_WORLD_NODE_RANK and decides how to invoke hwloc-bind. For example, something like this: mpirun -np 4 my_wrapper.sh ./test where my_wrapper.sh is: - #!/bin/sh if test "$OMPI_COMM_WORLD_RANK" = "0"; then bind_string=...whatever... else bind_string=...whatever... fi hwloc-bind $bind_string $* - Something like that. -- Jeff Squyres jsquy...@cisco.com For corporate legal information go to: http://www.cisco.com/web/about/doing_business/legal/cri/
Re: [hwloc-users] hwloc-ps output - how to verify process binding on the core level?
Brice Goglin, le Mon 14 Feb 2011 07:56:56 +0100, a écrit : > The operating system decides where each process runs (according to the > binding). It usually has no knowledge of MPI ranks. And I don't think it looks > at the PID numbers during the scheduling. It doesn't either, indeed. Samuel
Re: [hwloc-users] hwloc-ps output - how to verify process binding on the core level?
Le 14/02/2011 07:43, Siew Yin Chan a écrit : > >> >> > > No. Each hwloc-bind command in the mpirun above doesn't know that > there are other hwloc-bind instances on the same machine. All of > them bind their process to all cores in the first socket. > > => Agree. For socket:0.core:0-3 , hwloc will bind the MPI processes to > all cores in the first socket. But how are the individual processes > mapped on these cores? Will it be in this order: > > > rank 0 à core 0 > > rank 1 à core 1 > > rank 2 à core 2 > > rank 3 à core 3 > > > Or in this *arbitrary* order: > > > rank 0 à core 1 > > rank 1 à core 3 > > rank 2 à core 0 > > rank 3 à core 2 > The operating system decides where each process runs (according to the binding). It usually has no knowledge of MPI ranks. And I don't think it looks at the PID numbers during the scheduling. So it's very likely random. Please distinguish your replies from the test you quote. Otherwise, it's hard to understand where's your reply. I hope I didn't miss anything. Brice
Re: [hwloc-users] hwloc-ps output - how to verify process binding on the core level?
--- On Sun, 2/13/11, Brice Goglin wrote: From: Brice Goglin Subject: Re: [hwloc-users] hwloc-ps output - how to verify process binding on the core level? To: "Hardware locality user list" List-Post: hwloc-users@lists.open-mpi.org Date: Sunday, February 13, 2011, 3:07 AM Le 13/02/2011 04:54, Siew Yin Chan a écrit : Good day, I'm studying the impact of MPI process binding on communication costs in my project, and would like to use hwloc-bind to achieve fine-grained mapping control. I install hwloc 1.1.1 on a 2-socket 4-core machine (with 2 dual-core dies in each socket), and run hwloc-ps to verify the binding: $ mpirun -V mpirun (Open MPI) 1.5.1 $ mpirun -np 4 hwloc-bind socket:0.core:0-3 ./test hwloc-ps shows the following output: $ hwloc-ps -p 1497 Socket:0 ./test 1498 Socket:0 ./test 1499 Socket:0 ./test 1500 Socket:0 ./test $ hwloc-ps -l 1497 Socket:0 ./test 1498 Socket:0 ./test 1499 Socket:0 ./test 1500 Socket:0 ./test $ hwloc-ps -c 1497 0x0055 ./test 1498 0x0055 ./test 1499 0x0055 ./test 1500 0x0055 ./test Questions: 1. Does hwloc-bind map the processes *sequentially* on *successive* cores of the socket? Hello, No. Each hwloc-bind command in the mpirun above doesn't know that there are other hwloc-bind instances on the same machine. All of them bind their process to all cores in the first socket. => Agree. For socket:0.core:0-3 , hwloc will bind the MPI processes to all cores in the first socket. But how are the individual processes mapped on these cores? Will it be in this order: rank 0 à core 0rank 1 à core 1rank 2 à core 2rank 3 à core 3 Or in this *arbitrary* order: rank 0 à core 1rank 1 à core 3rank 2 à core 0rank 3 à core 2 2. How could hwloc-ps help verify this binding, i.e., process 1497 (rank 0) on socket.0:core.0 process 1498 (rank 1) on socket.0:core.1 process 1499 (rank 2) on socket.0:core.2 process 1500 (rank 3) on socket.0:core.3 (let's assume your mpirun command did what you want) You would get something like this from hwloc-ps: 1497 Core:0 ./test 1498 Core:1 ./test 1499 Core:2 ./test 1500 Core:0 ./test These core numbers are the logical core number among the entire machine. hwloc-ps can't easily show hierarchical location such as socket.core since there are many possible combinations, especially because of caches. Actually, you might get L1Cache instead of Core above since hwloc-ps reports the first object that exactly matches the process binding (and L1 is above but equal to Core in your machine). If you want to get other output, I suggest you use hwloc-calc to convert the hwloc-ps output. Equivalently, does the binding of `socket:0.core:0-1 socket:1.core:0-1' with hwloc-ps showing $ hwloc-ps -l 1315 L2Cache:0 L2Cache:2 ./test 1316 L2Cache:0 L2Cache:2 ./test 1317 L2Cache:0 L2Cache:2 ./test 1318 L2Cache:0 L2Cache:2 ./test indicate the the following? I.e., process 1315 (rank 0) on socket.0:core.0 process 1316 (rank 1) on socket.0:core.1 process 1317 (rank 2) on socket.1:core.0 process 1318 (rank 3) on socket.1:core.1 No. Again, all processes are bound to 4 different cores, so hwloc-ps shows the largest objects containing those cores. In the end, you want a MPI launcher that takes care of the binding instead of having to manually bind on the command line. It should be the case with most MPI launchers nowadays. Once this is ok, hwloc-ps will report this exact core where you bound. And you might need to play with hwloc-calc to rewrite the hwloc-ps output as you want. I am thinking of adding an option to hwloc-calc to help rewriting a random string into socket:X.core:Y or something like that. Brice -Inline Attachment Follows- ___ hwloc-users mailing list hwloc-us...@open-mpi.org http://www.open-mpi.org/mailman/listinfo.cgi/hwloc-users Sucker-punch spam with award-winning