Re: [OMPI devel] [PATCH] openib btl: extensable cpc selection enablement

Jeff Squyres Thu, 10 Jan 2008 13:28:38 -0500

On Jan 10, 2008, at 11:55 AM, Jon Mason wrote:

BTW, I should point out that the modex CPC string list stuff is
currently somewhat wasteful in the presence of multiple ports on a

host. This will definitely be bad at scale. Specifically, we'llsend

around a CPC string in the openib modex for *each* port.  This may be
repetitive (and wasteful at scale), especially if you have more than
one port/NIC of the same type in each host.  This can cause the modex
size to increase quite a bit.


While the message sent via modex is now longer, the number of messages
sent is the same.  So I would argue that this is only slight less
optimal than the current implementation.


Not at scale.

Consider if someone has 2,000 8-core servers, each with a 2-port HCA.Let's assume a full-machine run of 16,000 MPI processes, each who canuse 2 ports. Let's assume non-ConnectX HCAs to be conservative, sothey'll all be able to use the oob CPC (someday soon, RDMA CM and IBMCwill also be available, but let's start small).

Each of the 16k MPI procs will have "oob"+sizeof(uint32_t) twice intheir modex for a grand total of 14 extra bytes. No big deal on anindividual message, but consider that that's 16,000 * 14 = 224,000extra bytes being gathered to mpirun.

Then consider that the whole pile of modex data is glommed togetherand broadcast to each MPI process. Hence, we're now sending an extra16,000 * 14 * 16,000 = 3,584,000,000 bytes sent across the networkduring MPI_INIT (in addition to whatever is already being sent in themodex).

Ralph's work on the new ORTE branch will help this quite a bit (withthe routed oob stuff -- sending modex messages only once to each node,vs. once to each process), but still, the numbers are large:


- gather phase: 16,000 * 14 = 224,000 extra bytes
- scatter phase: 16,000 * 14 * 2,000 = 448,000 extra bytes

This is much more manageable, but still -- we should be careful whenwe can.

Switching to hashed names and index lists will save quite a bit. Forexample, if we do a dumb hash of the cpc name down to 1 byte and sendindex lists of which ports use each cpc (each index can be 1 byte --leading to a max of 256 ports in each host, which is probablysufficient for the forseeable future!), we're down to 3 extra bytes inthe modex which is much more manageable:


in today's non-routed OOB:
- gather phase: 16,000 * 3 = 48,000 extra bytes
- scatter phase: 16,000 * 3 * 16,000 = 768,000,000 extra bytes

in the soon-to-be per-host modex distribution:
- gather phase: 16,000 * 3 = 48,000 extra bytes
- scatter phase: 16,000 * 3 * 2,000 = 96,000,000 extra bytes

Additionally, the routed oob makes the reality even better than that,because it uses a tree distribution for the modex. So although theraw number of bytes is the same as a per-host-but-not-routed modexdistribution, the distribution is quite wide, potentially avoidingnetwork congestion (because different ports/links/servers areinvolved, all in parallel).


--
Jeff Squyres
Cisco Systems

Re: [OMPI devel] [PATCH] openib btl: extensable cpc selection enablement

Reply via email to