Hate to bring this up again, but I was thinking that an easy way to reduce the size of the modex would be to reduce the length of the names describing each piece of data.

More concretely, for a simple run I get the following names, each of which are sent over the wire for every proc (note that this will change depending on the number of btls one has active):
ompi-proc-info
btl.openib.1.3
btl.tcp.1.3
pml.base.1.0
btl.udapl.1.3

So that's 89 bytes of naming overhead (size of strings + dss packing overhead) per process.

A couple of possible solutions to this:
1. Send a 32 bit string hashes instead of the strings. This would reduce the per process size from 89 to 20 bytes, but there is always a (slight) possibility of collisions.

2. Change the way the dss packs strings. Currently, it packs a 32 bit sting length, the string, and a null terminator. It may be good enough to just pack the string a the NULL terminator. This would reduce per-process size from 89 to 69 bytes.

3. Reduce the length of the names. 'ompi-proc-info' could become simply 'pinf', and two of the separators could be removed in the other names (ex: 'btl.openib.1.3' -> 'btlopenib1.3'). This would change the per process size from 89 to 71 bytes.

4. Do #2 & #3. This would change the per process size from 89 to 51 bytes.

Anyways, just an idea for consideration.

Tim

WHAT: Changes to MPI layer modex API

WHY: To be mo' betta scalable

WHERE: ompi/mpi/runtime/ompi_module_exchange.* and everywhere that
calls ompi_modex_send() and/or ompi_modex_recv()

TIMEOUT: COB Fri 4 Apr 2008

DESCRIPTION:

Per some of the scalability discussions that have been occurring (some
on-list and some off-list), and per the e-mail I sent out last week
about ongoing work in the openib BTL, Ralph and I put together a loose
proposal this morning to make the modex more scalable. The timeout is
fairly short because Ralph wanted to start implementing in the near
future, and we didn't anticipate that this would be a contentious
proposal.

The theme is to break the modex into two different kinds of data:

- Modex data that is specific to a given proc
- Modex data that is applicable to all procs on a given node

For example, in the openib BTL, the majority of modex data is
applicable to all processes on the same node (GIDs and LIDs and
whatnot). It is much more efficient to send only one copy of such
node-specific data to each process (vs. sending ppn copies to each
process). The spreadsheet I included in last week's e-mail clearly
shows this.

1. Add new modex API functions. The exact function signatures are
TBD, but they will be generally of the form:

  * int ompi_modex_proc_send(...): send modex data that is specific to
this process. It is just about exactly the same as the current API
call (ompi_modex_send).

  * int ompi_modex_proc_recv(...): receive modex data from a specified
peer process (indexed on ompi_proc_t*). It is just about exactly the
same as the current API call (ompi_modex_recv).

  * int ompi_modex_node_send(...): send modex data that is relevant
for all processes in this job on this node. It is intended that only
one process in a job on a node will call this function. If more than
one process in a job on a node calls _node_send(), then only one will
"win" (meaning that the data sent by the others will be overwritten).

  * int ompi_modex_node_recv(...): receive modex data that is relevant
for a whole peer node; receive the ["winning"] blob sent by
_node_send() from the source node. We haven't yet decided what the
node index will be; it may be (ompi_proc_t*) (i.e., _node_recv() would
figure out what node the (ompi_proc_t*) resides on and then give you
the data).

2. Make the existing modex API calls (ompi_modex_send,
ompi_modex_recv) be wrappers around the new "proc" send/receive
calls. This will provide exactly the same functionality as the
current API (but be sub-optimal at scale). It will give BTL authors
(etc.) time to update to the new API, potentially taking advantage of
common data across multiple processes on the same node. We'll likely
put in some opal_output()'s in the wrappers to help identify code that
is still calling the old APIs.

3. Remove the old API calls (ompi_modex_send, ompi_modex_recv) before
v1.3 is released.



Reply via email to