Re: [hwloc-users] Strange binding issue on 40 core nodes and cgroups
Le 02/11/2012 21:22, Brice Goglin a écrit : > hwloc-bind --get-last-cpu-location --pid should give the same > info but it seems broken on my machine right now, going to debug. Actually, that works fine once you try it on a non-multithreaded program that uses all cores :) So you can use top or hwloc-bind --get-last-cpu-location --pid to find out where each process runs. Brice
Re: [hwloc-users] Strange binding issue on 40 core nodes and cgroups
Le 02/11/2012 21:03, Brock Palen a écrit : > This isn't a hwloc problem exactly, but maybe you can shed some insight. > > We have some 4 socket 10 core = 40 core nodes, HT off: > > depth 0: 1 Machine (type #1) > depth 1: 4 NUMANodes (type #2) > depth 2:4 Sockets (type #3) >depth 3: 4 Caches (type #4) > depth 4: 40 Caches (type #4) > depth 5: 40 Caches (type #4) > depth 6:40 Cores (type #5) >depth 7: 40 PUs (type #6) > > > We run rhel 6.3 we use torque to create cgroups for jobs. I get the > following cgroup for this job all 12 cores for the job are on one node: > cat /dev/cpuset/torque/8845236.nyx.engin.umich.edu/cpus > 0-1,4-5,8,12,16,20,24,28,32,36 > > Not all nicely spaced, but 12 cores > > I then start a code, even a simple serial code with openmpi 1.6.0 on all 12 > cores: > mpirun ./stream > > 45521 brockp20 0 1885m 1.8g 456 R 100.0 0.2 4:02.72 stream > > 45522 brockp20 0 1885m 1.8g 456 R 100.0 0.2 1:46.08 stream > > 45525 brockp20 0 1885m 1.8g 456 R 100.0 0.2 4:02.72 stream > > 45526 brockp20 0 1885m 1.8g 456 R 100.0 0.2 1:46.07 stream > > 45527 brockp20 0 1885m 1.8g 456 R 100.0 0.2 4:02.71 stream > > 45528 brockp20 0 1885m 1.8g 456 R 100.0 0.2 4:02.71 stream > > 45532 brockp20 0 1885m 1.8g 456 R 100.0 0.2 1:46.05 stream > > 45529 brockp20 0 1885m 1.8g 456 R 99.2 0.2 4:02.70 stream > > 45530 brockp20 0 1885m 1.8g 456 R 99.2 0.2 4:02.70 stream > > 45531 brockp20 0 1885m 1.8g 456 R 33.6 0.2 1:20.89 stream > > 45523 brockp20 0 1885m 1.8g 456 R 32.8 0.2 1:20.90 stream > > 45524 brockp20 0 1885m 1.8g 456 R 32.8 0.2 1:20.89 stream > > Note the processes that are not running at 100% cpu, > > hwloc-bind --get --pid 45523 > 0x0011,0x1133 > Hello Brock, I don't see any helpful to answer here :/ Do you know which core is overloaded and which (two?) cores are idle? Does that change during one run or from one run to another? Pressing 1 in top should give that information in the very first lines. Then, you can try to binding another process to one of the idle cores, to see if the kernel accepts that. You can also press "f" and "j" (or "f" and use arrows and space to select "last used cpu") to add a "P" line which tells you the last CPU used by each process. hwloc-bind --get-last-cpu-location --pid should give the same info but it seems broken on my machine right now, going to debug. One thing to check would be to run more than 12 cores and check where the kernel puts them. If it keeps ignoring two cores, that would be funny :) Brice
[hwloc-users] Strange binding issue on 40 core nodes and cgroups
This isn't a hwloc problem exactly, but maybe you can shed some insight. We have some 4 socket 10 core = 40 core nodes, HT off: depth 0:1 Machine (type #1) depth 1: 4 NUMANodes (type #2) depth 2: 4 Sockets (type #3) depth 3: 4 Caches (type #4) depth 4:40 Caches (type #4) depth 5: 40 Caches (type #4) depth 6: 40 Cores (type #5) depth 7: 40 PUs (type #6) We run rhel 6.3 we use torque to create cgroups for jobs. I get the following cgroup for this job all 12 cores for the job are on one node: cat /dev/cpuset/torque/8845236.nyx.engin.umich.edu/cpus 0-1,4-5,8,12,16,20,24,28,32,36 Not all nicely spaced, but 12 cores I then start a code, even a simple serial code with openmpi 1.6.0 on all 12 cores: mpirun ./stream 45521 brockp20 0 1885m 1.8g 456 R 100.0 0.2 4:02.72 stream 45522 brockp20 0 1885m 1.8g 456 R 100.0 0.2 1:46.08 stream 45525 brockp20 0 1885m 1.8g 456 R 100.0 0.2 4:02.72 stream 45526 brockp20 0 1885m 1.8g 456 R 100.0 0.2 1:46.07 stream 45527 brockp20 0 1885m 1.8g 456 R 100.0 0.2 4:02.71 stream 45528 brockp20 0 1885m 1.8g 456 R 100.0 0.2 4:02.71 stream 45532 brockp20 0 1885m 1.8g 456 R 100.0 0.2 1:46.05 stream 45529 brockp20 0 1885m 1.8g 456 R 99.2 0.2 4:02.70 stream 45530 brockp20 0 1885m 1.8g 456 R 99.2 0.2 4:02.70 stream 45531 brockp20 0 1885m 1.8g 456 R 33.6 0.2 1:20.89 stream 45523 brockp20 0 1885m 1.8g 456 R 32.8 0.2 1:20.90 stream 45524 brockp20 0 1885m 1.8g 456 R 32.8 0.2 1:20.89 stream Note the processes that are not running at 100% cpu, hwloc-bind --get --pid 45523 0x0011,0x1133 hwloc-calc 0x0011,0x1133 --intersect PU 0,1,2,3,4,5,6,7,8,9,10,11 So all ranks in the job should see all 12 cores. The same cgroup is reported by /proc//cgroup Not only that I can make things work by forcing binding in the mpi launcher: mpirun -bind-to-core ./stream 46886 brockp20 0 1885m 1.8g 456 R 99.8 0.2 0:15.49 stream 46887 brockp20 0 1885m 1.8g 456 R 99.8 0.2 0:15.49 stream 46888 brockp20 0 1885m 1.8g 456 R 99.8 0.2 0:15.48 stream 46889 brockp20 0 1885m 1.8g 456 R 99.8 0.2 0:15.49 stream 46890 brockp20 0 1885m 1.8g 456 R 99.8 0.2 0:15.48 stream 46891 brockp20 0 1885m 1.8g 456 R 99.8 0.2 0:15.48 stream 46892 brockp20 0 1885m 1.8g 456 R 99.8 0.2 0:15.47 stream 46893 brockp20 0 1885m 1.8g 456 R 99.8 0.2 0:15.47 stream 46894 brockp20 0 1885m 1.8g 456 R 99.8 0.2 0:15.47 stream 46895 brockp20 0 1885m 1.8g 456 R 99.8 0.2 0:15.47 stream 46896 brockp20 0 1885m 1.8g 456 R 99.8 0.2 0:15.46 stream 46897 brockp20 0 1885m 1.8g 456 R 99.8 0.2 0:15.46 stream Things are now working as expected, and I should stress this is inside the same torque job and cgroup that I started with. A multi threaded version of the code does use close to 12 cores as expected. If I cervumvent out batch system and the cgroups a normal mpirun ./stream does start 12 processes that consume a full 100% core. Thoughts? This is really odd linux scheduler behavior. Brock Palen www.umich.edu/~brockp CAEN Advanced Computing bro...@umich.edu (734)936-1985