REVISION 2 (based on feedback in last 24 hours).

Changes:

- NETWORK instead of NETWORK_TYPE
- Shared memory and process loopback are not affected by this CLI
- Change the OPAL API usage.

I actually like points 1-8 below quite a bit.  If implemented in ALL 
BTLs/MTLs/etc., it can solve the "how do I disable XYZ across all of Open MPI?" 
problem nicely.

Point 9 -- what does QUALIFIER mean/how is it used? -- still needs work (no 
real updates since rev 1 of this proposal).  I am thinking that QUALIFIER 
(somehow) can be used to figure out which OMPI code path to use for a given 
network (e.g., BTL vs. MTL, etc.).

-----

  mpirun --[enable|disable] NETWORK[:QUALIFIER][,NETWORK[:QUALIFIER]]*
  # Or "--[net|nonet]", or some other name if "enable|disable" is too general.
  # Suggestions welcome.

1. The intent of these CLI options is to easily enable/disable specific network 
types and/or specific interfaces.

2. The use of shared memory and process loopback is assumed (and is not 
affected by these CLI options -- the "expert" level must be used if specific 
control over shared memory / loopback is desired).

3. Both forms take a comma-delimited list of 1 or more items.

4. --enable would work similar to our "include" MCA params: OMPI will *only* 
use the network type(s) listed (but will still use shared memory and process 
loopback).

5. --disable would work similar to our "exclude" MCA params: OMPI will use all 
network types *except* those listed (but will still used shared memory and 
process loopback).

6. NETWORK values can generally be one of three things:

   - a human-recognizable name (e.g., "ib", "ethernet", ...etc.)
   - a Linux interface device name (e.g., "eth0", "usnic_0", "mlx4_0", 
optionally specifying a specific port if desired and relevant, such as 
"mlx4_0:1")
   - a network address (e.g., "10.20.0.0/16", which specifies a specific 
network interface+port)

7. NETWORK and QUALIFIER values are parsed (by orterun/etc.) and distributed to 
MPI processes.

8. MPI processes can query the NETWORK values during BTL/MTL/etc. 
initialization and selection.

It may be sufficient to have a simple "did the user specify this NETWORK 
value?" (case insensitive) query function that just returns a boolean.

For example, the TCP BTL could look like this (only showing "enable" logic for 
simplicity -- adding "disable" logic is an exercise left for the reader):

-----
  if (opal_network_value("eth") || opal_network_value("ethernet")) {
      want_all_ip_interfaces = true;
  } else {
      foreach IP_interface {
          // Search for strings like "eth0" or "10.10.0.0/16"
          if (opal_network_value(ip_interface_name) ||
              opal_network_value(CIDR of ip_interface_name)) {
              push(@desired_interfaces, ip_interface_name);
          }
      }
  }

  foreach IP_interface {
      if (want_all_ip_interfaces || @desired_interfaces contains ip_interface) {
          make a module for that IP interface
      }
  }
-----

The usnic BTL would likely be quite similar to the TCP BTL, but also look for 
strings like "usnic_0".

The openib BTL could look like this:

-----
  if (opal_network_value("ib") || opal_network_value("infiniband")) {
      want_all_ib_interfaces = true;
  } else if (opal_network_value("roce") {
      want_all_roce_interfaces = true;
  } else if (opal_network_value("iwarp") {
      want_all_iwarp_interfaces = true;
  } else if (opal_network_value("eth") || opal_network_value("ethernet")) {
      want_all_roce_interfaces = true;
      want_all_iwarp_interfaces = true;
  } else {
      foreach verbs_interface {
          // Search for strings like "mlx4_0" or "10.50.0.0/16" for 
RoCE/iWARP/IB with IPoIB enabled.
          // Could also search for IB subnet IDs, if desired...?
          if (opal_network_value(verbs_interface_name) ||
              opal_network_value(subnet ID of verbs_interface_name) ||
              opal_network_value(IP CIDR of verbs_interface_name)) {
              push(@desired_interfaces, verbs_interface_name);
          }
      }
  }

  foreach verbs_interface {
      make_module = false;
      if (@desired_interfaces contains verbs_interface) {
          make_module = true;
      } else if (verbs_interface is IB && want_all_ib_interfaces)
          make_module = true;
      } else if (verbs_interface is RoCE && want_all_roce_interfaces)
          make_module = true;
      } else if (verbs_interface is iWARP && want_all_iwarp_interfaces)
          make_module = true;
      }
      if (make_module) {
          make a module for that verbs interface
      }
  }
-----

I imagine that the MXM MTL, Yalla PML, and hcoll and FCA colls, could be 
similar, but slightly simpler since they (assumedly) don't care about iWARP 
interfaces.

PSM / PSM2 / uGNI / Portals / etc. can all do similar things.

The key here is that ALL BTLs, MTLs, OSC, and COLL modules -- anything that 
talks directly to the network -- will need to use this opal_network_value() API.

9. The ":QUALIFIER" value is optional for each NETWORK_TYPE specified, and can 
be used to disambiguate when a given network type can be reached multiple ways 
in OMPI.  E.g., it can help choose between the openib BTL, the MXM MTL, and the 
Yalla PML.  E.g.:

  mpirun --enable ib:btl
  mpirun --enable ib:mtl
  mpirun --enable ib:yalla

That being said, I don't like these names (btl, mtl, yalla) because they mean 
nothing to non-OMPI experts.  But I like the concept that a QUALIFIER can 
(somehow) help choose between the different OMPI code paths.

Here's another example:

  mpirun --enable eth:tcp
  mpirun --enable eth:usnic

These QUALIFIER values are a *little* better, but not much -- the user still 
has to know that they exist to know to choose one of them ("tcp" and "usnic").  
But note that usNIC will someday have tag matching support, so it will be able 
to be used through the OFI MTL, too.  Hence, "eth:usnic" won't be unique...

...thoughts?

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

Reply via email to