I gotcha.  

I chose the cg class A results specifically because it showed poor
performance on our machine.  (I think that's because CG passes lots of
small messages, but I'm not too certain...)  I'm looking for "tricks"
to get a cluster to run a little bit faster.  I did another run of the
cg benchmarks recently on a 2.0.36 kernel w/ that TCP_NODELAY patch
compiled in.  It seems to help.  I only did the test runs on eight
processors.  But it looks like this...

 Name Class   NC      Time      Mop/s Mop/s/proc Version Filename
 CG   A        8     12.97     115.41      14.43     2.3 cg.A.8.nova2036
 CG   A        8     16.66      89.83      11.23     2.3 cg.A.8.nstar-2.0-test

As opposed to what we had before...
 CG   A        8     34.67      43.16       5.39     2.3 cg.A.8.egcs3
 CG   A        8     35.78      41.82       5.23     2.3 cg.A.8.egcs3-f
 CG   A        8     39.55      37.84       4.73     2.3 cg.A.8.egcs3-star

I don't know yet if this scales correctly, but the point here is to
understand how the machine performs under a heavy network load.

Our cg class C benchmarks show some drop, but it's not quite (sort of) as
bad...

 Name Class   NC      Time      Mop/s Mop/s/proc Version Filename
 CG   C        4   3685.10      38.90       9.72     2.3 cg.C.4.egcs3
 CG   C        8   1665.59      86.06      10.76     2.3 cg.C.8.egcs3
 CG   C       16   1125.52     127.36       7.96     2.3 cg.C.16.egcs3
 CG   C       16   1186.77     120.79       7.55     2.3 cg.C.16.egcs3
 CG   C       32    722.80     198.32       6.20     2.3 cg.C.32.egcs3
 CG   C       64    669.49     214.11       3.35     2.3 cg.C.64.egcs3

I still wouldn't call that linear scaling.  Our lu and bt results
were a bit nicer...

 Name Class   NC      Time      Mop/s Mop/s/proc Version Filename
 LU   C        2  22827.14      89.32      44.66     2.3 lu.C.2.pgcc
 LU   C        4  11456.69     177.97      44.49     2.3 lu.C.4.pgcc
 LU   C        8   5849.49     348.58      43.57     2.3 lu.C.8.pgcc
 LU   C       32   1671.70    1219.72      38.12     2.3 lu.C.32.pgcc
 LU   C       64   1095.83    1860.70      29.07     2.3 lu.C.64.pgcc

 Name Class   NC      Time      Mop/s Mop/s/proc Version Filename
 BT   C        9   8577.89     334.15      37.13     2.3 bt.C.9.pgcc
 BT   C       25   3447.56     831.39      33.26     2.3 bt.C.25.pgcc
 BT   C       36   2524.95    1135.18      31.53     2.3 bt.C.36.pgcc
 BT   C       49   2691.03    1065.12      21.74     2.3 bt.C.49.pgcc
 BT   C       64   2258.70    1268.99      19.83     2.3 bt.C.64.pgcc
 BT   C       81   1917.52    1494.78      18.45     2.3 bt.C.81.pgcc
 BT   C      100   1592.76    1799.57      18.00     2.3 bt.C.100.pgcc
 BT   C      121   2100.84    1364.35      11.28     2.3 bt.C.121.pgcc
 BT   C      121   2135.58    1342.16      11.09     2.3 bt.C.121.pgcc

I'm going to rerun 'em soon if I get a chance.  These runs were done
about a year ago.  Just got to convince everyone to let me use the
cluster in a little while.

> It's important to read the documentation for the NAS benchmarks
> before using them.  Pierre's only half right - communication here
> IS the problem, but because of the latency, not the throughput.

Can you explain a little more what you mean by that?

> The NAS benchmarks have three different class sizes of problems,
> A, B, and C (well, there is a fourth, W, but ignore that for now ;)
> A is the smallest benchmark size - i.e., the smallest amount of
> work to be done.  Divide that by 64 CPUs and you have a little
> computation and a LOT of communication.  On the other hand, C
> is the _largest_ benchmark size; divide that by 64 CPUs and you
> have a LOT of computation relative to the amount of communication
> that you do, so here your speedups should be good.
> 
> My suggestion would be that you try a class B or class C size
> problem before reporting speedup results and complaining that
> the speedups are small for a 64 node cluster.

What do you think now?

I'm out to understand why our cluster performs the way it does and
do whatever I can to speed it up.  I'm really not sure where to start.
I mean, if you want to see where the cluster performs poorly you
choose small runs on large numbers of processors.  And then you do
the runs 'cuz whatever changes you make will be most apparent there,
right?  

It feels good to look at some tremendous problem on 100 processors and say
"wow! speedup of 99!".  But how much fun is that?  Just tells you that
you did a good job.  Let's find out where things could be made faster,
that's the hard part from the standpoint of building the cluster.  I've
got something like 100 machines now.  I'm really out to start tuning
the TCP/IP stack, eepro100 driver, and MPICH to get us running a little
bit faster.

A good chunk of the "optimizations" we've been doing involve rearranging
bits and pieces of the network, mucking around with compiler flags and
swapping kernel versions around.

> Granted, there is a limit to how much speedup can be obtained
> (search for "Amdahl's Law"...), even with excellent hardware
> (and I'm talking better hardware than Myrinet here, folks), but
> for C sizes CG should at least scale nearly linearly for 
> the node ranges you're talking about.
> 
> -Bob
> 
>   ----- Forwarded message from Pierre Brua -----
> 
>   [Charset iso-8859-1 unsupported, filtering to ASCII...]
>   Eric Roman wrote:
>   > But when
>   > we started doing 64 processor runs we got some horrible performance.
>   > Here's an example:
>   > 
>   >  Name Class   NC      Time      Mop/s Mop/s/proc Version Filename
>   >  CG   A        1     65.91      22.71      22.71     2.3 cg.A.1.egcs3-t3
>   ...
>   >  CG   A       64     28.72      52.11       0.81     2.3 cg.A.64.egcs3
>   > 
>   > Speedup of 2 for 64 processors?  Does this make any sense whatsoever?
> 
>         Those benchmarks are used to test the network speed of millions dollars
>   parallel computers that usually have specially tuned network cards and
>   protocols. And 100Mb ethernet+TCP/IP network is awful for that because
>   it has _not_ been designed for it : if an ethernet packet meets another
>   one on the wire, the ethernet protocols automatically delays new packets
>   for a random-calculated amound of time. And Ethernet performance drops
>   to less than 20% of the maximum throughput if you overload it.
> -
> Linux SMP list: FIRST see FAQ at http://www.irisa.fr/prive/mentre/smp-faq/
> To Unsubscribe: send "unsubscribe linux-smp" to [EMAIL PROTECTED]

-- 
Eric Roman  <[EMAIL PROTECTED]>   Department of Applied Mathematics
(516)632-8545                         SUNY/Stony Brook
-
Linux SMP list: FIRST see FAQ at http://www.irisa.fr/prive/mentre/smp-faq/
To Unsubscribe: send "unsubscribe linux-smp" to [EMAIL PROTECTED]

Reply via email to