There appears to have been a change in the behaviour of -npersocket from 1.4.3 to 1.6.x (tested with 1.6.2). Below is what I see on a pair of dual quad-core socket Nehalem nodes running under PBS. Is this expected?
Thanks David [dbs900@v482 ~/MPI]$ mpirun -V mpirun (Open MPI) 1.4.3 ... [dbs900@v482 ~/MPI]$ mpirun --report-bindings -npersocket 3 -np 12 ./numa143 [v482:03367] [[64945,0],0] odls:default:fork binding child [[64945,1],0] to socket 0 cpus 0001 [v482:03367] [[64945,0],0] odls:default:fork binding child [[64945,1],1] to socket 0 cpus 0002 [v482:03367] [[64945,0],0] odls:default:fork binding child [[64945,1],2] to socket 0 cpus 0004 [v482:03367] [[64945,0],0] odls:default:fork binding child [[64945,1],3] to socket 1 cpus 0010 [v482:03367] [[64945,0],0] odls:default:fork binding child [[64945,1],4] to socket 1 cpus 0020 [v482:03367] [[64945,0],0] odls:default:fork binding child [[64945,1],5] to socket 1 cpus 0040 [v483:31768] [[64945,0],1] odls:default:fork binding child [[64945,1],6] to socket 0 cpus 0001 [v483:31768] [[64945,0],1] odls:default:fork binding child [[64945,1],7] to socket 0 cpus 0002 [v483:31768] [[64945,0],1] odls:default:fork binding child [[64945,1],8] to socket 0 cpus 0004 [v483:31768] [[64945,0],1] odls:default:fork binding child [[64945,1],9] to socket 1 cpus 0010 [v483:31768] [[64945,0],1] odls:default:fork binding child [[64945,1],10] to socket 1 cpus 0020 [v483:31768] [[64945,0],1] odls:default:fork binding child [[64945,1],11] to socket 1 cpus 0040 ... [dbs900@v482 ~/MPI]$ mpirun -V mpirun (Open MPI) 1.6.2 ... [dbs900@v482 ~/MPI]$ mpirun --report-bindings -npersocket 3 -np 12 ./numa162 -------------------------------------------------------------------------- Your job has requested a conflicting number of processes for the application: App: ./numa162 number of procs: 12 This is more processes than we can launch under the following additional directives and conditions: number of sockets: 0 npersocket: 3 Please revise the conflict and try again. -------------------------------------------------------------------------- -------------------------------------------------------------------------- A daemon (pid unknown) died unexpectedly on signal 1 while attempting to launch so we are aborting. There may be more information reported by the environment (see above). This may be because the daemon was unable to find all the needed shared libraries on the remote node. You may set your LD_LIBRARY_PATH to have the location of the shared libraries on the remote nodes and this will automatically be forwarded to the remote nodes. -------------------------------------------------------------------------- -------------------------------------------------------------------------- mpirun noticed that the job aborted, but has no info as to the process that caused that situation. --------------------------------------------------------------------------