Maybe this is related to Reuti's "-hostfile ignored in 1.6.1" on the users mail list, but not quite sure.

Let's pretend my nodes are called local, r1, and r2. That is, I launch mpirun from "local" and there are two other (remote) nodes available to me. With the trunk (e.g., v1.9 r27247), I get

% mpirun --bynode --nooversubscribe --host r1,r1,r1,r2,r2,r2 -n 6 --tag-output hostname
    [1,0]<stdout>:r1
    [1,1]<stdout>:r2
    [1,2]<stdout>:r1
    [1,3]<stdout>:r2
    [1,4]<stdout>:r1
    [1,5]<stdout>:r2

which seems right to me.  But when the local node is involved:

% mpirun --bynode --nooversubscribe --host local,local,local,r1,r1,r1 -np 4 --tag-output hostname
    [1,0]<stdout>:local
    [1,1]<stdout>:r1
    [1,2]<stdout>:r1
    [1,3]<stdout>:r1
% mpirun --bynode --nooversubscribe --host local,local,local,r1,r1,r1 -np 5 --tag-output hostname

--------------------------------------------------------------------------
There are not enough slots available in the system to satisfy the 5 slots
    that were requested by the application:
      hostname

Either request fewer slots for your application, or make more slots available
    for use.

--------------------------------------------------------------------------

I'm not seeing all the local slots I should be seeing. We're seeing wide-scale MTT trunk failures due to this problem.

There is a similar loss of local slots with hostfile syntax.  E.g.,

    % hostname
    local
    % cat               hostfile
    local
    r1
    % mpirun --hostfile hostfile -n 2 hostname

--------------------------------------------------------------------------
    A hostfile was provided that contains at least one node not
    present in the allocation:

      hostfile:  hostfile
      node:      local

    If you are operating in a resource-managed environment, then only
    nodes that are in the allocation can be used in the hostfile. You
    may find relative node syntax to be a useful alternative to
    specifying absolute node names see the orte_hosts man page for
    further information.


--------------------------------------------------------------------------

The problem is solved with "--mca orte_default_hostname hostfile".

Reply via email to