On Mar 22, 2011, at 14:20 , Ralph Castain wrote:

> Hi folks
> 
> For those interested in trying it, I completed backporting the multicast 
> grpcomm module from my branch over the last weekend. This allows all modex 
> and other ORTE-level collective operations to occur via multicast, which 
> significantly improves the performance of those operations.

Looks promising. Based on my understanding of the multicast protocols and their 
implementations, I wonder how you overcome some of the limitations of the UDP 
multicast.

As the IP multicast is a one-to-many protocol, only broadcast types of 
collective can be expressed efficiently. So this only cover half the modex 
operations and half the initial application spawn (not the daemon URI 
collection). However, this is still better than nothing!

Unfortunately, multicast over UDP inherit one of the major feature from UDP, 
it's unreliability. While packet drop can hardly be triggered on a single 
switch configuration, this is not a reliable approach. I noticed you 
implemented a fixed size windows (based on a circular ring) to increase the 
reliability of the UDP rmcast. However, what will happens when thousands modex 
messages will collide is not yet clear? Apparently, if the lost message is not 
found on the buffer, no drastic action is taken (aka the job will just hang). 
Thus, without a __reliability__ layer built-on, this is not a practice we 
should encourage on a production quality software.

If we assume the context of a LAN, then there are 3 categories: hub only LAN, 
switch without IGMP and switch with IGMP control. The first two are similar, 
the broadcast is going over all output links (it is a flooding protocol: the 
message will be dropped at the kernel level, if no application awaits for it). 
For the last class, the output is only going on the segments where hosts have 
requested it.  Therefore, in order to make sure nobody miss a single multicast, 
one has to verify that all processes supposed to get involved in the bcast, are 
readily available for receiving. While this doesn't sound like a big issue, it 
implies a many-to-one type of operation in the context of ORTE.

Last issue is about the port/address allocation. It appears that the current 
implementation relies on MCA parameters (base_multicast_ports) to insure 
uniqueness of port/multicast address allocation. Therefore, when two mpirun run 
simultaneously on different machines of the same cluster, the user (or the 
users) will have to ensure mutual exclusion of the ports.

  george.

> In order to use it, you'll need to add --enable-multicast to your configure, 
> and -mca grpcomm mcast to your cmd line. You'll also need a reasonably good 
> udp multicast environment. The new module will work with any launch 
> environment.
> 
> I'm not really focused on scalability in my branch (mostly on resilience), 
> but I did some quick experiments and found that the new module reduced modex 
> time by quite a bit, depending on system and scale of course.
> 
> I hope to finish my backport over the next week or so - the last part will 
> enable ALL orte system operations to be done via multicast. This eliminates 
> things like the initial TCP connection flood back to the HNP when the daemons 
> are launched. Again, I don't focus much on scalability, so anyone wanting to 
> test that capability at scale will be welcome. I'll send out another note 
> when it is ready.
> 
> Ralph
> 
> 
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> http://www.open-mpi.org/mailman/listinfo.cgi/devel

"To preserve the freedom of the human mind then and freedom of the press, every 
spirit should be ready to devote itself to martyrdom; for as long as we may 
think as we will, and speak as we think, the condition of man will proceed in 
improvement."
  -- Thomas Jefferson, 1799


Reply via email to