On Jan 10, 2008, at 11:55 AM, Jon Mason wrote:

BTW, I should point out that the modex CPC string list stuff is
currently somewhat wasteful in the presence of multiple ports on a
host. This will definitely be bad at scale. Specifically, we'll send
around a CPC string in the openib modex for *each* port.  This may be
repetitive (and wasteful at scale), especially if you have more than
one port/NIC of the same type in each host.  This can cause the modex
size to increase quite a bit.

While the message sent via modex is now longer, the number of messages
sent is the same.  So I would argue that this is only slight less
optimal than the current implementation.

Not at scale.

Consider if someone has 2,000 8-core servers, each with a 2-port HCA. Let's assume a full-machine run of 16,000 MPI processes, each who can use 2 ports. Let's assume non-ConnectX HCAs to be conservative, so they'll all be able to use the oob CPC (someday soon, RDMA CM and IBMC will also be available, but let's start small).

Each of the 16k MPI procs will have "oob"+sizeof(uint32_t) twice in their modex for a grand total of 14 extra bytes. No big deal on an individual message, but consider that that's 16,000 * 14 = 224,000 extra bytes being gathered to mpirun.

Then consider that the whole pile of modex data is glommed together and broadcast to each MPI process. Hence, we're now sending an extra 16,000 * 14 * 16,000 = 3,584,000,000 bytes sent across the network during MPI_INIT (in addition to whatever is already being sent in the modex).

Ralph's work on the new ORTE branch will help this quite a bit (with the routed oob stuff -- sending modex messages only once to each node, vs. once to each process), but still, the numbers are large:

- gather phase: 16,000 * 14 = 224,000 extra bytes
- scatter phase: 16,000 * 14 * 2,000 = 448,000 extra bytes

This is much more manageable, but still -- we should be careful when we can.

Switching to hashed names and index lists will save quite a bit. For example, if we do a dumb hash of the cpc name down to 1 byte and send index lists of which ports use each cpc (each index can be 1 byte -- leading to a max of 256 ports in each host, which is probably sufficient for the forseeable future!), we're down to 3 extra bytes in the modex which is much more manageable:

in today's non-routed OOB:
- gather phase: 16,000 * 3 = 48,000 extra bytes
- scatter phase: 16,000 * 3 * 16,000 = 768,000,000 extra bytes

in the soon-to-be per-host modex distribution:
- gather phase: 16,000 * 3 = 48,000 extra bytes
- scatter phase: 16,000 * 3 * 2,000 = 96,000,000 extra bytes

Additionally, the routed oob makes the reality even better than that, because it uses a tree distribution for the modex. So although the raw number of bytes is the same as a per-host-but-not-routed modex distribution, the distribution is quite wide, potentially avoiding network congestion (because different ports/links/servers are involved, all in parallel).

--
Jeff Squyres
Cisco Systems

Reply via email to