Hey Sasha, I was going to submit this after I had a chance to test on one of our big clusters to see if it worked 100% right. But my final testing has been delayed (for a month now!). Ira said some folks from Sonoma were interested in this, so I'll go ahead and post it.
This is a patch for something I call "port_offsetting" (name/description of the option is open to suggestion). Basically, we want to move to using lmc > 0 on our clusters b/c some of the newer MPI implementations take advantage of multiple lids and have shown faster performance when lmc > 0. The problem is that those users that do not use the newer MPI implementations, or do not run their code in a way that can take advantage of multiple lids, suffer great performance degradation in their code. We determined that the primary issue is what we started calling "base lid alignment". Here's a simple example. Assume LMC = 2 and we are trying to route the lids of 4 ports (A,B,C,D). Those lids are: port A - 1,2,3,4 port B - 5,6,7,8 port C - 9,10,11,12 port D - 13,14,15,16 Suppose forwarding of these lids goes through 4 switch ports. If we cycle through the ports like updn/minhop currently do, we would see something like this. switch port 1: 1, 5, 9, 13 switch port 2: 2, 6, 10, 14 switch port 3: 3, 7, 11, 15 switch port 4: 4, 8, 12, 16 Note that the base lid of each port (lids 1, 5, 9, 13) goes through only 1 port of the switch. Thus a user that uses only the base lid is using only 1 port out of the 4 ports they could be using. Leading to terrible performance. We want to get this instead. switch port 1: 1, 8, 11, 14 switch port 2: 2, 5, 12, 15 switch port 3: 3, 6, 9, 16 switch port 4: 4, 7, 10, 13 where base lids are distributed in a more even manner. In order to do this, we (effectively) iterate through all ports like before, but we iterate starting at a different index depending on the number of paths we have routed thus far. On one of our clusters, some testing has shown when we run w/ LMC=1 and 1 task per node, mpibench (AlltoAll tests) range from 10-30% worse than when LMC=0 is used. With LMC=2, mpibench tends to be 50-70% worse in performance than with LMC=0. With the port offsetting option, the performance degradation ranges 1-5% worse than LMC=0. I am currently at a loss why I cannot get it to be even to LMC=0, but 1-5% is small enough to not make users mad :-) The part I haven't been able to test yet is whether newer MPIs that do take advantage of LMC > 0 run equally when my port_offsetting is turned off and on. That's the part I'm still haven't been able to test. Thanks, look forward to your comments, Al -- Albert Chu [EMAIL PROTECTED] 925-422-5311 Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
