I believe there is an issue over the default setting for the number of sockets on a node. We changed to discovering it in the 1.7 and beyond series, but the default value in the 1.6 series got set to zero (it defaults to 1 I believe for 1.4).
Try adding "-mca orte_num_sockets N -mca orte_num_cores M", where N=#sockets on your nodes and M=#cores on each socket, to your cmd line. On Nov 7, 2012, at 1:32 PM, David Singleton <david.single...@anu.edu.au> wrote: > > There appears to have been a change in the behaviour of -npersocket from > 1.4.3 to 1.6.x (tested with 1.6.2). Below is what I see on a pair of dual > quad-core socket Nehalem nodes running under PBS. Is this expected? > > Thanks > David > > > [dbs900@v482 ~/MPI]$ mpirun -V > mpirun (Open MPI) 1.4.3 > ... > [dbs900@v482 ~/MPI]$ mpirun --report-bindings -npersocket 3 -np 12 ./numa143 > [v482:03367] [[64945,0],0] odls:default:fork binding child [[64945,1],0] to > socket 0 cpus 0001 > [v482:03367] [[64945,0],0] odls:default:fork binding child [[64945,1],1] to > socket 0 cpus 0002 > [v482:03367] [[64945,0],0] odls:default:fork binding child [[64945,1],2] to > socket 0 cpus 0004 > [v482:03367] [[64945,0],0] odls:default:fork binding child [[64945,1],3] to > socket 1 cpus 0010 > [v482:03367] [[64945,0],0] odls:default:fork binding child [[64945,1],4] to > socket 1 cpus 0020 > [v482:03367] [[64945,0],0] odls:default:fork binding child [[64945,1],5] to > socket 1 cpus 0040 > [v483:31768] [[64945,0],1] odls:default:fork binding child [[64945,1],6] to > socket 0 cpus 0001 > [v483:31768] [[64945,0],1] odls:default:fork binding child [[64945,1],7] to > socket 0 cpus 0002 > [v483:31768] [[64945,0],1] odls:default:fork binding child [[64945,1],8] to > socket 0 cpus 0004 > [v483:31768] [[64945,0],1] odls:default:fork binding child [[64945,1],9] to > socket 1 cpus 0010 > [v483:31768] [[64945,0],1] odls:default:fork binding child [[64945,1],10] to > socket 1 cpus 0020 > [v483:31768] [[64945,0],1] odls:default:fork binding child [[64945,1],11] to > socket 1 cpus 0040 > ... > > [dbs900@v482 ~/MPI]$ mpirun -V > mpirun (Open MPI) 1.6.2 > ... > [dbs900@v482 ~/MPI]$ mpirun --report-bindings -npersocket 3 -np 12 ./numa162 > -------------------------------------------------------------------------- > Your job has requested a conflicting number of processes for the > application: > > App: ./numa162 > number of procs: 12 > > This is more processes than we can launch under the following > additional directives and conditions: > > number of sockets: 0 > npersocket: 3 > > Please revise the conflict and try again. > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > A daemon (pid unknown) died unexpectedly on signal 1 while attempting to > launch so we are aborting. > > There may be more information reported by the environment (see above). > > This may be because the daemon was unable to find all the needed shared > libraries on the remote node. You may set your LD_LIBRARY_PATH to have the > location of the shared libraries on the remote nodes and this will > automatically be forwarded to the remote nodes. > -------------------------------------------------------------------------- > -------------------------------------------------------------------------- > mpirun noticed that the job aborted, but has no info as to the process > that caused that situation. > -------------------------------------------------------------------------- > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel