I gotcha. I chose the cg class A results specifically because it showed poor performance on our machine. (I think that's because CG passes lots of small messages, but I'm not too certain...) I'm looking for "tricks" to get a cluster to run a little bit faster. I did another run of the cg benchmarks recently on a 2.0.36 kernel w/ that TCP_NODELAY patch compiled in. It seems to help. I only did the test runs on eight processors. But it looks like this... Name Class NC Time Mop/s Mop/s/proc Version Filename CG A 8 12.97 115.41 14.43 2.3 cg.A.8.nova2036 CG A 8 16.66 89.83 11.23 2.3 cg.A.8.nstar-2.0-test As opposed to what we had before... CG A 8 34.67 43.16 5.39 2.3 cg.A.8.egcs3 CG A 8 35.78 41.82 5.23 2.3 cg.A.8.egcs3-f CG A 8 39.55 37.84 4.73 2.3 cg.A.8.egcs3-star I don't know yet if this scales correctly, but the point here is to understand how the machine performs under a heavy network load. Our cg class C benchmarks show some drop, but it's not quite (sort of) as bad... Name Class NC Time Mop/s Mop/s/proc Version Filename CG C 4 3685.10 38.90 9.72 2.3 cg.C.4.egcs3 CG C 8 1665.59 86.06 10.76 2.3 cg.C.8.egcs3 CG C 16 1125.52 127.36 7.96 2.3 cg.C.16.egcs3 CG C 16 1186.77 120.79 7.55 2.3 cg.C.16.egcs3 CG C 32 722.80 198.32 6.20 2.3 cg.C.32.egcs3 CG C 64 669.49 214.11 3.35 2.3 cg.C.64.egcs3 I still wouldn't call that linear scaling. Our lu and bt results were a bit nicer... Name Class NC Time Mop/s Mop/s/proc Version Filename LU C 2 22827.14 89.32 44.66 2.3 lu.C.2.pgcc LU C 4 11456.69 177.97 44.49 2.3 lu.C.4.pgcc LU C 8 5849.49 348.58 43.57 2.3 lu.C.8.pgcc LU C 32 1671.70 1219.72 38.12 2.3 lu.C.32.pgcc LU C 64 1095.83 1860.70 29.07 2.3 lu.C.64.pgcc Name Class NC Time Mop/s Mop/s/proc Version Filename BT C 9 8577.89 334.15 37.13 2.3 bt.C.9.pgcc BT C 25 3447.56 831.39 33.26 2.3 bt.C.25.pgcc BT C 36 2524.95 1135.18 31.53 2.3 bt.C.36.pgcc BT C 49 2691.03 1065.12 21.74 2.3 bt.C.49.pgcc BT C 64 2258.70 1268.99 19.83 2.3 bt.C.64.pgcc BT C 81 1917.52 1494.78 18.45 2.3 bt.C.81.pgcc BT C 100 1592.76 1799.57 18.00 2.3 bt.C.100.pgcc BT C 121 2100.84 1364.35 11.28 2.3 bt.C.121.pgcc BT C 121 2135.58 1342.16 11.09 2.3 bt.C.121.pgcc I'm going to rerun 'em soon if I get a chance. These runs were done about a year ago. Just got to convince everyone to let me use the cluster in a little while. > It's important to read the documentation for the NAS benchmarks > before using them. Pierre's only half right - communication here > IS the problem, but because of the latency, not the throughput. Can you explain a little more what you mean by that? > The NAS benchmarks have three different class sizes of problems, > A, B, and C (well, there is a fourth, W, but ignore that for now ;) > A is the smallest benchmark size - i.e., the smallest amount of > work to be done. Divide that by 64 CPUs and you have a little > computation and a LOT of communication. On the other hand, C > is the _largest_ benchmark size; divide that by 64 CPUs and you > have a LOT of computation relative to the amount of communication > that you do, so here your speedups should be good. > > My suggestion would be that you try a class B or class C size > problem before reporting speedup results and complaining that > the speedups are small for a 64 node cluster. What do you think now? I'm out to understand why our cluster performs the way it does and do whatever I can to speed it up. I'm really not sure where to start. I mean, if you want to see where the cluster performs poorly you choose small runs on large numbers of processors. And then you do the runs 'cuz whatever changes you make will be most apparent there, right? It feels good to look at some tremendous problem on 100 processors and say "wow! speedup of 99!". But how much fun is that? Just tells you that you did a good job. Let's find out where things could be made faster, that's the hard part from the standpoint of building the cluster. I've got something like 100 machines now. I'm really out to start tuning the TCP/IP stack, eepro100 driver, and MPICH to get us running a little bit faster. A good chunk of the "optimizations" we've been doing involve rearranging bits and pieces of the network, mucking around with compiler flags and swapping kernel versions around. > Granted, there is a limit to how much speedup can be obtained > (search for "Amdahl's Law"...), even with excellent hardware > (and I'm talking better hardware than Myrinet here, folks), but > for C sizes CG should at least scale nearly linearly for > the node ranges you're talking about. > > -Bob > > ----- Forwarded message from Pierre Brua ----- > > [Charset iso-8859-1 unsupported, filtering to ASCII...] > Eric Roman wrote: > > But when > > we started doing 64 processor runs we got some horrible performance. > > Here's an example: > > > > Name Class NC Time Mop/s Mop/s/proc Version Filename > > CG A 1 65.91 22.71 22.71 2.3 cg.A.1.egcs3-t3 > ... > > CG A 64 28.72 52.11 0.81 2.3 cg.A.64.egcs3 > > > > Speedup of 2 for 64 processors? Does this make any sense whatsoever? > > Those benchmarks are used to test the network speed of millions dollars > parallel computers that usually have specially tuned network cards and > protocols. And 100Mb ethernet+TCP/IP network is awful for that because > it has _not_ been designed for it : if an ethernet packet meets another > one on the wire, the ethernet protocols automatically delays new packets > for a random-calculated amound of time. And Ethernet performance drops > to less than 20% of the maximum throughput if you overload it. > - > Linux SMP list: FIRST see FAQ at http://www.irisa.fr/prive/mentre/smp-faq/ > To Unsubscribe: send "unsubscribe linux-smp" to [EMAIL PROTECTED] -- Eric Roman <[EMAIL PROTECTED]> Department of Applied Mathematics (516)632-8545 SUNY/Stony Brook - Linux SMP list: FIRST see FAQ at http://www.irisa.fr/prive/mentre/smp-faq/ To Unsubscribe: send "unsubscribe linux-smp" to [EMAIL PROTECTED]
