On Feb 1, 2013, at 9:59 PM, "Barrett, Brian W" <bwba...@sandia.gov> wrote:

> I don't think this is right either. Excluding a device that doesn't exist has 
> many use cases. Such as disabling a network that only exists on part of the 
> cluster.  I'm not sure about what to do with seq; it's more like include than 
> exclude.

Hmm.  I've now given this quite a bit of thought.  Here's what I think:

1. Just like there might be good reasons to exclude non-existent interfaces 
(e.g., networks that only include on part of the cluster), the same argument 
could be made for *including* non-existent interfaces.

2. It seems odd to me to have different behavior for non-existent interfaces 
between include, exclude, and/or seq.

3. We have a very strong precedent throughout OMPI that if a human asks for 
something that OMPI can't deliver, OMPI should error.  According to this, and 
according to the Law of Least Surprise, I would think that if I typo an exclude 
interface name, OMPI should error and make a human figure it out.

4. If someone wants different includes/excludes in different parts of the 
cluster, then they should have per-node values for these MCA params.

5. That being said, #4 is not always feasible.  Concrete example (which is why 
this whole thing started, incidentally): in my MTT cluster at Cisco, I have 
*some* nodes with back-to-back interfaces.  I can't think of a good way to have 
per-node MCA params in an MTT run that is SLURM-queued and may end up on random 
nodes in my cluster -- that may or may not include nodes with loopback 
interfaces.

So how about this compromise:

If an invalid include, exclude, or if_seq interface is specified:
- If that interface is prefaced with "nowarn:", silently ignore that token
- Otherwise, display a show_help message and ignore the TCP BTL

For example:

    mpirun --mca btl_tcp_if_include nowarn:eth5,eth6

- If eth5 doesn't exist, the job will continue just as if eth5 wasn't specified
- If eth6 doesn't exist, the TCP BTL will disqualify itself

(BTW: yes, I'm volunteering to code up whatever we agree on)

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/


Reply via email to