Michael,
After going through this for a few weeks, there are a couple of
important things. Firstly, get the best optimised libraries, either
Intel's MKL 6.1 (not any earlier) or libgoto
(http://www.cs.utexas.edu/users/kgoto). This will double your
performance.
Secondly, you need to partition the job up right, however, for 16
processors, there isn't much choice (i.e. Ps and Qs). Experience is
that best performance is either a square partitioning (i.e. Ps = Qs) or
slightly rectangular (i.e. Ps < Qs) but remember Ps*Qs is the number of
processors available. Also NB of around 208 or 224 worked well for me
for libgoto (although other reports are of 108 being good).
Other parameters don't make a large difference, but you do need to set
them right. I've attached the output from my best run to give you a
reasonable start.
Finally, you will need a problem size of 176000 seems to be good for
2GB RAM per node. It will occupy all of RAM, with no swapping (in fact,
you can probably go up a bit, but by 180000 you will probably start
swapping).
Now, in reality, you are pushing it to get better than 60% of maximum
theoretical performance. Also, a lot of it is fake, as two run in a row
do not always get exactly the same results, since small variations in
the whole system can make a noticable diffefence. With the same
parameters attached I had results between 1075Gflops and 1095GFlops.
Just for interest this result is for a 152 Dell PowerEdge 1750's with
each with dual 3.06GHz Xeons and 2GB RAM. All this is connected by a
high-performance Gigabit switch (Foundry FI1500). (Oh, and the network
performance does make a big difference.) After testing we reached 107
on the October 2003 Top 500 list (and would have slotted in about 47 on
the previous one :-( - we are still trying to work out who bought the 60
1Tflop+ machines!!!).
Frank
On Thu, 2004-02-12 at 03:18, Michael Edwards wrote:
> I am trying to use the High Performance Linpack (hpl) benchmarking suite suggested
> at the top500.org site to get an estimate of how my cluster stacks up. I am using
> Athalon 2600 CPUs so I would expect to see something like a Gigaflop per cpu. I
> have 8 dual processor nodes (I generally exclude the head node from runs unless I
> forget) so I would expect something like 16 Gflops. The highest benchmark I have
> ever been able to squeeze out of the thing is 8 Gflops and that was with the problem
> size set so high it would occasionally overflow my available memory and crash.
>
> Any thoughts on how I can improve or are these results reasonable? I am more
> interested in general cluster improvement then tuning the linpack program itself. I
> heard the 2.6 kernel was beter for SMP performance for example, but I wasn't sure
> how big a difference it would make in the benchmarking. I haven't gotten my hands
> dirty with kernel stuff yet, and I am reluctant to mess with it unless it is likely
> to make a large difference.
> Cluster does fine with my problem, I am tweaking it mostly for ego value and
> educational value. :)
>
> If this is outside the topic of this list, I would be interested in knowing what
> forum would be more appropriate for general clustering questions like this.
>
> Posibily pertinant info:
> nodes use an integrated NIC (probably generic ethernet)
> 3Com SuperStackII Hub (10/100 TX I think)
> Head node is a PIII 450 (which is why I leave it out of calcs)
> using "mpirun n1-8 -np 16 xhpl" to run the program
> Nodes have 1G Ram and dual Athalon 2600 CPUs
> Oscar 3.0 on RH9
> Kernel: 2.4.20-8smp
>
>
>
> -------------------------------------------------------
> SF.Net is sponsored by: Speed Start Your Linux Apps Now.
> Build and deploy apps & Web services for Linux with
> a free DVD software kit from IBM. Click Now!
> http://ads.osdn.com/?ad_id56&alloc_id438&op=click
> _______________________________________________
> Oscar-users mailing list
> [EMAIL PROTECTED]
> https://lists.sourceforge.net/lists/listinfo/oscar-users
--
ac3
Suite G16, Bay 7, Locomotive Workshop Phone: 02 9209 4600
Australian Technology Park Fax: 02 9209 4611
Eveleigh NSW 1430
============================================================================
HPLinpack 1.0 -- High-Performance Linpack benchmark -- September 27, 2000
Written by A. Petitet and R. Clint Whaley, Innovative Computing Labs., UTK
============================================================================
An explanation of the input/output parameters follows:
T/V : Wall time / encoded variant.
N : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P : The number of process rows.
Q : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.
The following parameter values will be used:
N : 40000 175000 176000 177000 178000 179000 180000 181000
182000
NB : 200
P : 8
Q : 38
PFACT : Right
NBMIN : 8
NDIV : 2
RFACT : Left
BCAST : 1ringM
DEPTH : 1
SWAP : Mix (threshold = 64)
L1 : transposed form
U : transposed form
EQUIL : yes
ALIGN : 8 double precision words
----------------------------------------------------------------------------
- The matrix A is randomly generated for each test.
- The following scaled residual checks will be computed:
1) ||Ax-b||_oo / ( eps * ||A||_1 * N )
2) ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 )
3) ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo )
- The relative machine precision (eps) is taken to be 1.110223e-16
- Computational tests pass if scaled residuals are less than 16.0
============================================================================
T/V N NB P Q Time Gflops
----------------------------------------------------------------------------
W11L2R8 40000 200 8 38 78.36 5.445e+02
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0158003 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0144471 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0028370 ...... PASSED
============================================================================
T/V N NB P Q Time Gflops
----------------------------------------------------------------------------
W11L2R8 175000 200 8 38 3331.72 1.072e+03
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0673868 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0089810 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0015430 ...... PASSED
============================================================================
HPLinpack 1.0 -- High-Performance Linpack benchmark -- September 27, 2000
Written by A. Petitet and R. Clint Whaley, Innovative Computing Labs., UTK
============================================================================
An explanation of the input/output parameters follows:
T/V : Wall time / encoded variant.
N : The order of the coefficient matrix A.
NB : The partitioning blocking factor.
P : The number of process rows.
Q : The number of process columns.
Time : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.
The following parameter values will be used:
N : 40000 175000 176000 177000 178000 179000 180000 181000
182000
NB : 200
P : 8
Q : 38
PFACT : Right
NBMIN : 8
NDIV : 2
RFACT : Left
BCAST : 1ringM
DEPTH : 1
SWAP : Mix (threshold = 64)
L1 : transposed form
U : transposed form
EQUIL : yes
ALIGN : 8 double precision words
----------------------------------------------------------------------------
- The matrix A is randomly generated for each test.
- The following scaled residual checks will be computed:
1) ||Ax-b||_oo / ( eps * ||A||_1 * N )
2) ||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 )
3) ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo )
- The relative machine precision (eps) is taken to be 1.110223e-16
- Computational tests pass if scaled residuals are less than 16.0
============================================================================
T/V N NB P Q Time Gflops
----------------------------------------------------------------------------
W11L2R8 40000 200 8 38 77.31 5.519e+02
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0175013 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0160024 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0031424 ...... PASSED
============================================================================
T/V N NB P Q Time Gflops
----------------------------------------------------------------------------
W11L2R8 175000 200 8 38 3311.22 1.079e+03
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0627572 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0083640 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0014369 ...... PASSED
============================================================================
T/V N NB P Q Time Gflops
----------------------------------------------------------------------------
W11L2R8 176000 200 8 38 3360.40 1.082e+03
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0094875 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0077927 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0013420 ...... PASSED
============================================================================
T/V N NB P Q Time Gflops
----------------------------------------------------------------------------
W11L2R8 177000 200 8 38 3418.90 1.081e+03
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0043984 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0085455 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0015046 ...... PASSED
============================================================================
T/V N NB P Q Time Gflops
----------------------------------------------------------------------------
W11L2R8 178000 200 8 38 3460.28 1.087e+03
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0382668 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0082226 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0014446 ...... PASSED
============================================================================
T/V N NB P Q Time Gflops
----------------------------------------------------------------------------
W11L2R8 179000 200 8 38 3490.59 1.095e+03
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0162521 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0086945 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0014488 ...... PASSED
============================================================================
T/V N NB P Q Time Gflops
----------------------------------------------------------------------------
W11L2R8 180000 200 8 38 3569.05 1.089e+03
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0039023 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0081021 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0013326 ...... PASSED
============================================================================
T/V N NB P Q Time Gflops
----------------------------------------------------------------------------
W11L2R8 181000 200 8 38 3665.82 1.078e+03
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0046288 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0083695 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0014308 ...... PASSED
============================================================================
T/V N NB P Q Time Gflops
----------------------------------------------------------------------------
W11L2R8 182000 200 8 38 4589.01 8.758e+02
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1 * N ) = 0.0094461 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1 * ||x||_1 ) = 0.0089341 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) = 0.0015777 ...... PASSED
============================================================================
Finished 9 tests with the following results:
9 tests completed and passed residual checks,
0 tests completed and failed residual checks,
0 tests skipped because of illegal input values.
----------------------------------------------------------------------------
End of Tests.
============================================================================