I believe there is an issue over the default setting for the number of sockets 
on a node. We changed to discovering it in the 1.7 and beyond series, but the 
default value in the 1.6 series got set to zero (it defaults to 1 I believe for 
1.4).

Try adding "-mca orte_num_sockets N -mca orte_num_cores M", where N=#sockets on 
your nodes and M=#cores on each socket, to your cmd line.


On Nov 7, 2012, at 1:32 PM, David Singleton <david.single...@anu.edu.au> wrote:

> 
> There appears to have been a change in the behaviour of -npersocket from
> 1.4.3 to 1.6.x (tested with 1.6.2). Below is what I see on a pair of dual
> quad-core socket Nehalem nodes running under PBS.  Is this expected?
> 
> Thanks
> David
> 
> 
> [dbs900@v482 ~/MPI]$ mpirun -V
> mpirun (Open MPI) 1.4.3
> ...
> [dbs900@v482 ~/MPI]$ mpirun --report-bindings -npersocket 3 -np 12 ./numa143
> [v482:03367] [[64945,0],0] odls:default:fork binding child [[64945,1],0] to 
> socket 0 cpus 0001
> [v482:03367] [[64945,0],0] odls:default:fork binding child [[64945,1],1] to 
> socket 0 cpus 0002
> [v482:03367] [[64945,0],0] odls:default:fork binding child [[64945,1],2] to 
> socket 0 cpus 0004
> [v482:03367] [[64945,0],0] odls:default:fork binding child [[64945,1],3] to 
> socket 1 cpus 0010
> [v482:03367] [[64945,0],0] odls:default:fork binding child [[64945,1],4] to 
> socket 1 cpus 0020
> [v482:03367] [[64945,0],0] odls:default:fork binding child [[64945,1],5] to 
> socket 1 cpus 0040
> [v483:31768] [[64945,0],1] odls:default:fork binding child [[64945,1],6] to 
> socket 0 cpus 0001
> [v483:31768] [[64945,0],1] odls:default:fork binding child [[64945,1],7] to 
> socket 0 cpus 0002
> [v483:31768] [[64945,0],1] odls:default:fork binding child [[64945,1],8] to 
> socket 0 cpus 0004
> [v483:31768] [[64945,0],1] odls:default:fork binding child [[64945,1],9] to 
> socket 1 cpus 0010
> [v483:31768] [[64945,0],1] odls:default:fork binding child [[64945,1],10] to 
> socket 1 cpus 0020
> [v483:31768] [[64945,0],1] odls:default:fork binding child [[64945,1],11] to 
> socket 1 cpus 0040
> ...
> 
> [dbs900@v482 ~/MPI]$ mpirun -V
> mpirun (Open MPI) 1.6.2
> ...
> [dbs900@v482 ~/MPI]$ mpirun --report-bindings -npersocket 3 -np 12 ./numa162
> --------------------------------------------------------------------------
> Your job has requested a conflicting number of processes for the
> application:
> 
> App: ./numa162
> number of procs:  12
> 
> This is more processes than we can launch under the following
> additional directives and conditions:
> 
> number of sockets:   0
> npersocket:   3
> 
> Please revise the conflict and try again.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
> launch so we are aborting.
> 
> There may be more information reported by the environment (see above).
> 
> This may be because the daemon was unable to find all the needed shared
> libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
> location of the shared libraries on the remote nodes and this will
> automatically be forwarded to the remote nodes.
> --------------------------------------------------------------------------
> --------------------------------------------------------------------------
> mpirun noticed that the job aborted, but has no info as to the process
> that caused that situation.
> --------------------------------------------------------------------------
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel


Reply via email to