On Mar 22, 2011, at 4:03 PM, George Bosilca wrote: > > On Mar 22, 2011, at 14:20 , Ralph Castain wrote: > >> Hi folks >> >> For those interested in trying it, I completed backporting the multicast >> grpcomm module from my branch over the last weekend. This allows all modex >> and other ORTE-level collective operations to occur via multicast, which >> significantly improves the performance of those operations. > > Looks promising. Based on my understanding of the multicast protocols and > their implementations, I wonder how you overcome some of the limitations of > the UDP multicast. > > As the IP multicast is a one-to-many protocol, only broadcast types of > collective can be expressed efficiently. So this only cover half the modex > operations and half the initial application spawn (not the daemon URI > collection). However, this is still better than nothing! > > Unfortunately, multicast over UDP inherit one of the major feature from UDP, > it's unreliability. While packet drop can hardly be triggered on a single > switch configuration, this is not a reliable approach. I noticed you > implemented a fixed size windows (based on a circular ring) to increase the > reliability of the UDP rmcast. However, what will happens when thousands > modex messages will collide is not yet clear? Apparently, if the lost message > is not found on the buffer, no drastic action is taken (aka the job will just > hang). Thus, without a __reliability__ layer built-on, this is not a practice > we should encourage on a production quality software.
All true - I consider this solely in a development stage. The reliability code in the UDP mcast is adequate for my needs in ORCM where I have a continuous stream of messages. So loss of a single message from a source is quickly detected when I get the next one. The ring buffer size can be arbitrarily set by param, but the default is big enough for my purposes. Getting the system to at least error out when a resend is outside the buffer window is easy to do. I should definitely at least have it call the errmgr when that happens so it can decide the right course of action. > > If we assume the context of a LAN, then there are 3 categories: hub only LAN, > switch without IGMP and switch with IGMP control. The first two are similar, > the broadcast is going over all output links (it is a flooding protocol: the > message will be dropped at the kernel level, if no application awaits for > it). For the last class, the output is only going on the segments where hosts > have requested it. Therefore, in order to make sure nobody miss a single > multicast, one has to verify that all processes supposed to get involved in > the bcast, are readily available for receiving. While this doesn't sound like > a big issue, it implies a many-to-one type of operation in the context of > ORTE. Agreed - which is why the different scope arguments are there. Dealing with messages efficiently is a good challenge. > > Last issue is about the port/address allocation. It appears that the current > implementation relies on MCA parameters (base_multicast_ports) to insure > uniqueness of port/multicast address allocation. Therefore, when two mpirun > run simultaneously on different machines of the same cluster, the user (or > the users) will have to ensure mutual exclusion of the ports. Absolutely true! For ORCM, the HNP's ess module "announces" itself on multicast, listens for all other "mpirun" equivalents out there, and then sets its channels accordingly. I haven't worried about that for OMPI, but we can address it if there is interest in this mode of operation. And if not - well, it is an interesting experimental capability :-) > > george. > >> In order to use it, you'll need to add --enable-multicast to your configure, >> and -mca grpcomm mcast to your cmd line. You'll also need a reasonably good >> udp multicast environment. The new module will work with any launch >> environment. >> >> I'm not really focused on scalability in my branch (mostly on resilience), >> but I did some quick experiments and found that the new module reduced modex >> time by quite a bit, depending on system and scale of course. >> >> I hope to finish my backport over the next week or so - the last part will >> enable ALL orte system operations to be done via multicast. This eliminates >> things like the initial TCP connection flood back to the HNP when the >> daemons are launched. Again, I don't focus much on scalability, so anyone >> wanting to test that capability at scale will be welcome. I'll send out >> another note when it is ready. >> >> Ralph >> >> >> _______________________________________________ >> devel mailing list >> de...@open-mpi.org >> http://www.open-mpi.org/mailman/listinfo.cgi/devel > > "To preserve the freedom of the human mind then and freedom of the press, > every spirit should be ready to devote itself to martyrdom; for as long as we > may think as we will, and speak as we think, the condition of man will > proceed in improvement." > -- Thomas Jefferson, 1799 > > > _______________________________________________ > devel mailing list > de...@open-mpi.org > http://www.open-mpi.org/mailman/listinfo.cgi/devel