Oops, I forgot about one other important measurement we did. The following are the Average Send/Receive MPI bandwidths as measured by mpigraph (http://sourceforge.net/projects/mpigraph). Again, using updn routing.
LMC=0 Send 391 MB/s Recv 461 MB/s LMC=1 Send 292 MB/s Recv 358 MB/s LMC=2 Send 197 MB/s Recv 241 MB/s with my port offsetting turned on. I got LMC=1 Send 387 MB/s Recv 457 MB/s LMC=2 Send 383 MB/s Recv 455 MB/s So similar to the AlltoAll MPI tests, the port offsetting gets the numbers back to about what they were at LMC=0. Al On Wed, 2008-05-28 at 17:14 -0700, Al Chu wrote: > Hey Sasha, > > Attached are some numbers from a recent run I did with my port > offsetting patches. I ran w/ mvapich 0.9.9 and OpenMPI 1.2.6 on 120 > nodes. I ran w/ 1 task per node or 8 tasks per node (nodes have 8 > processors each), trying LMC=0, LMC=1, and LMC=2 with the original > 'updn', then LMC=1 and LMC=2 with my port-offsetting patch (labeled > "PO"). Next to these columns are the percentage worse the numbers are > in comparison to LMC=0. My understanding is that mvapich 0.9.9 does not > know how to take advantage of multiple lids while openMPI 1.2.6 does > know how to take advantage of it. > > I think the key numbers to notice are that without port-offsetting, > performance relative to LMC=0 is pretty bad when the MPI implementation > does not know how to take advantage of multiple lids (mvapich 0.9.9). > LMC=1 shows ~30% performance degradation and LMC=2 shows ~90% > degradation on this cluster. With the port-offsetting turned on, the > degradation falls to 0%-6%, a few times even being faster. We consider > this within "noise" levels. > > For MPIs that do know how to take advantage of multiple lids it seems > that the port-offsetting patch doesn't affect performance that much. > (See OpenMPI 1.2.6 sections). > > PLMK what you think. Thanks. > > Al > > On Thu, 2008-04-10 at 14:10 -0700, Al Chu wrote: > > Hey Sasha, > > > > I was going to submit this after I had a chance to test on one of our > > big clusters to see if it worked 100% right. But my final testing has > > been delayed (for a month now!). Ira said some folks from Sonoma were > > interested in this, so I'll go ahead and post it. > > > > This is a patch for something I call "port_offsetting" (name/description > > of the option is open to suggestion). Basically, we want to move to > > using lmc > 0 on our clusters b/c some of the newer MPI implementations > > take advantage of multiple lids and have shown faster performance when > > lmc > 0. > > > > The problem is that those users that do not use the newer MPI > > implementations, or do not run their code in a way that can take > > advantage of multiple lids, suffer great performance degradation in > > their code. We determined that the primary issue is what we started > > calling "base lid alignment". Here's a simple example. > > > > Assume LMC = 2 and we are trying to route the lids of 4 ports (A,B,C,D). > > Those lids are: > > > > port A - 1,2,3,4 > > port B - 5,6,7,8 > > port C - 9,10,11,12 > > port D - 13,14,15,16 > > > > Suppose forwarding of these lids goes through 4 switch ports. If we > > cycle through the ports like updn/minhop currently do, we would see > > something like this. > > > > switch port 1: 1, 5, 9, 13 > > switch port 2: 2, 6, 10, 14 > > switch port 3: 3, 7, 11, 15 > > switch port 4: 4, 8, 12, 16 > > > > Note that the base lid of each port (lids 1, 5, 9, 13) goes through only > > 1 port of the switch. Thus a user that uses only the base lid is using > > only 1 port out of the 4 ports they could be using. Leading to terrible > > performance. > > > > We want to get this instead. > > > > switch port 1: 1, 8, 11, 14 > > switch port 2: 2, 5, 12, 15 > > switch port 3: 3, 6, 9, 16 > > switch port 4: 4, 7, 10, 13 > > > > where base lids are distributed in a more even manner. > > > > In order to do this, we (effectively) iterate through all ports like > > before, but we iterate starting at a different index depending on the > > number of paths we have routed thus far. > > > > On one of our clusters, some testing has shown when we run w/ LMC=1 and > > 1 task per node, mpibench (AlltoAll tests) range from 10-30% worse than > > when LMC=0 is used. With LMC=2, mpibench tends to be 50-70% worse in > > performance than with LMC=0. > > > > With the port offsetting option, the performance degradation ranges 1-5% > > worse than LMC=0. I am currently at a loss why I cannot get it to be > > even to LMC=0, but 1-5% is small enough to not make users mad :-) > > > > The part I haven't been able to test yet is whether newer MPIs that do > > take advantage of LMC > 0 run equally when my port_offsetting is turned > > off and on. That's the part I'm still haven't been able to test. > > > > Thanks, look forward to your comments, > > > > Al > > > > > -- > Albert Chu > [EMAIL PROTECTED] > 925-422-5311 > Computer Scientist > High Performance Systems Division > Lawrence Livermore National Laboratory > _______________________________________________ > general mailing list > [email protected] > http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general > > To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general -- Albert Chu [EMAIL PROTECTED] 925-422-5311 Computer Scientist High Performance Systems Division Lawrence Livermore National Laboratory _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
