This isn't a hwloc problem exactly, but maybe you can shed some insight. We have some 4 socket 10 core = 40 core nodes, HT off:
depth 0: 1 Machine (type #1) depth 1: 4 NUMANodes (type #2) depth 2: 4 Sockets (type #3) depth 3: 4 Caches (type #4) depth 4: 40 Caches (type #4) depth 5: 40 Caches (type #4) depth 6: 40 Cores (type #5) depth 7: 40 PUs (type #6) We run rhel 6.3 we use torque to create cgroups for jobs. I get the following cgroup for this job all 12 cores for the job are on one node: cat /dev/cpuset/torque/8845236.nyx.engin.umich.edu/cpus 0-1,4-5,8,12,16,20,24,28,32,36 Not all nicely spaced, but 12 cores I then start a code, even a simple serial code with openmpi 1.6.0 on all 12 cores: mpirun ./stream 45521 brockp 20 0 1885m 1.8g 456 R 100.0 0.2 4:02.72 stream 45522 brockp 20 0 1885m 1.8g 456 R 100.0 0.2 1:46.08 stream 45525 brockp 20 0 1885m 1.8g 456 R 100.0 0.2 4:02.72 stream 45526 brockp 20 0 1885m 1.8g 456 R 100.0 0.2 1:46.07 stream 45527 brockp 20 0 1885m 1.8g 456 R 100.0 0.2 4:02.71 stream 45528 brockp 20 0 1885m 1.8g 456 R 100.0 0.2 4:02.71 stream 45532 brockp 20 0 1885m 1.8g 456 R 100.0 0.2 1:46.05 stream 45529 brockp 20 0 1885m 1.8g 456 R 99.2 0.2 4:02.70 stream 45530 brockp 20 0 1885m 1.8g 456 R 99.2 0.2 4:02.70 stream 45531 brockp 20 0 1885m 1.8g 456 R 33.6 0.2 1:20.89 stream 45523 brockp 20 0 1885m 1.8g 456 R 32.8 0.2 1:20.90 stream 45524 brockp 20 0 1885m 1.8g 456 R 32.8 0.2 1:20.89 stream Note the processes that are not running at 100% cpu, hwloc-bind --get --pid 45523 0x00000011,0x11111133 <the same mask is reported for all 12 processes> hwloc-calc 0x00000011,0x11111133 --intersect PU 0,1,2,3,4,5,6,7,8,9,10,11 So all ranks in the job should see all 12 cores. The same cgroup is reported by /proc/<pid>/cgroup Not only that I can make things work by forcing binding in the mpi launcher: mpirun -bind-to-core ./stream 46886 brockp 20 0 1885m 1.8g 456 R 99.8 0.2 0:15.49 stream 46887 brockp 20 0 1885m 1.8g 456 R 99.8 0.2 0:15.49 stream 46888 brockp 20 0 1885m 1.8g 456 R 99.8 0.2 0:15.48 stream 46889 brockp 20 0 1885m 1.8g 456 R 99.8 0.2 0:15.49 stream 46890 brockp 20 0 1885m 1.8g 456 R 99.8 0.2 0:15.48 stream 46891 brockp 20 0 1885m 1.8g 456 R 99.8 0.2 0:15.48 stream 46892 brockp 20 0 1885m 1.8g 456 R 99.8 0.2 0:15.47 stream 46893 brockp 20 0 1885m 1.8g 456 R 99.8 0.2 0:15.47 stream 46894 brockp 20 0 1885m 1.8g 456 R 99.8 0.2 0:15.47 stream 46895 brockp 20 0 1885m 1.8g 456 R 99.8 0.2 0:15.47 stream 46896 brockp 20 0 1885m 1.8g 456 R 99.8 0.2 0:15.46 stream 46897 brockp 20 0 1885m 1.8g 456 R 99.8 0.2 0:15.46 stream Things are now working as expected, and I should stress this is inside the same torque job and cgroup that I started with. A multi threaded version of the code does use close to 12 cores as expected. If I cervumvent out batch system and the cgroups a normal mpirun ./stream does start 12 processes that consume a full 100% core. Thoughts? This is really odd linux scheduler behavior. Brock Palen www.umich.edu/~brockp CAEN Advanced Computing bro...@umich.edu (734)936-1985