There appears to have been a change in the behaviour of -npersocket from
1.4.3 to 1.6.x (tested with 1.6.2). Below is what I see on a pair of dual
quad-core socket Nehalem nodes running under PBS.  Is this expected?

Thanks
David


[dbs900@v482 ~/MPI]$ mpirun -V
mpirun (Open MPI) 1.4.3
...
[dbs900@v482 ~/MPI]$ mpirun --report-bindings -npersocket 3 -np 12 ./numa143
[v482:03367] [[64945,0],0] odls:default:fork binding child [[64945,1],0] to 
socket 0 cpus 0001
[v482:03367] [[64945,0],0] odls:default:fork binding child [[64945,1],1] to 
socket 0 cpus 0002
[v482:03367] [[64945,0],0] odls:default:fork binding child [[64945,1],2] to 
socket 0 cpus 0004
[v482:03367] [[64945,0],0] odls:default:fork binding child [[64945,1],3] to 
socket 1 cpus 0010
[v482:03367] [[64945,0],0] odls:default:fork binding child [[64945,1],4] to 
socket 1 cpus 0020
[v482:03367] [[64945,0],0] odls:default:fork binding child [[64945,1],5] to 
socket 1 cpus 0040
[v483:31768] [[64945,0],1] odls:default:fork binding child [[64945,1],6] to 
socket 0 cpus 0001
[v483:31768] [[64945,0],1] odls:default:fork binding child [[64945,1],7] to 
socket 0 cpus 0002
[v483:31768] [[64945,0],1] odls:default:fork binding child [[64945,1],8] to 
socket 0 cpus 0004
[v483:31768] [[64945,0],1] odls:default:fork binding child [[64945,1],9] to 
socket 1 cpus 0010
[v483:31768] [[64945,0],1] odls:default:fork binding child [[64945,1],10] to 
socket 1 cpus 0020
[v483:31768] [[64945,0],1] odls:default:fork binding child [[64945,1],11] to 
socket 1 cpus 0040
...

[dbs900@v482 ~/MPI]$ mpirun -V
mpirun (Open MPI) 1.6.2
...
[dbs900@v482 ~/MPI]$ mpirun --report-bindings -npersocket 3 -np 12 ./numa162
--------------------------------------------------------------------------
Your job has requested a conflicting number of processes for the
application:

App: ./numa162
number of procs:  12

This is more processes than we can launch under the following
additional directives and conditions:

number of sockets:   0
npersocket:   3

Please revise the conflict and try again.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A daemon (pid unknown) died unexpectedly on signal 1  while attempting to
launch so we are aborting.

There may be more information reported by the environment (see above).

This may be because the daemon was unable to find all the needed shared
libraries on the remote node. You may set your LD_LIBRARY_PATH to have the
location of the shared libraries on the remote nodes and this will
automatically be forwarded to the remote nodes.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that the job aborted, but has no info as to the process
that caused that situation.
--------------------------------------------------------------------------

Reply via email to