Michael,
        After going through this for a few weeks, there are a couple of
important things.  Firstly, get the best optimised libraries, either
Intel's MKL 6.1 (not any earlier) or libgoto
(http://www.cs.utexas.edu/users/kgoto).  This will double your
performance.

        Secondly, you need to partition the job up right, however, for 16
processors, there isn't much choice (i.e. Ps and Qs).  Experience is
that best performance is either a square partitioning (i.e. Ps = Qs) or
slightly rectangular (i.e. Ps < Qs) but remember Ps*Qs is the number of
processors available.  Also NB of around 208 or 224 worked well for me
for libgoto (although other reports are of 108 being good).

        Other parameters don't make a large difference, but you do need to set
them right.  I've attached the output from my best run to give you a
reasonable start.

        Finally, you will need a problem size of 176000 seems to be good for
2GB RAM per node.  It will occupy all of RAM, with no swapping (in fact,
you can probably go up a bit, but by 180000 you will probably start
swapping).

        Now, in reality, you are pushing it to get better than 60% of maximum
theoretical performance.  Also, a lot of it is fake, as two run in a row
do not always get exactly the same results, since small variations in
the whole system can make a noticable diffefence.  With the same
parameters attached I had results between 1075Gflops and 1095GFlops.

        Just for interest this result is for a 152 Dell PowerEdge 1750's with
each with dual 3.06GHz Xeons and 2GB RAM.  All this is connected by a
high-performance Gigabit switch (Foundry FI1500).  (Oh, and the network
performance does make a big difference.)  After testing we reached 107
on the October 2003 Top 500 list (and would have slotted in about 47 on
the previous one :-( - we are still trying to work out who bought the 60
1Tflop+ machines!!!).

Frank

On Thu, 2004-02-12 at 03:18, Michael Edwards wrote:
> I am trying to use the High Performance Linpack (hpl) benchmarking suite suggested 
> at the top500.org site to get an estimate of how my cluster stacks up.  I am using 
> Athalon 2600 CPUs so I would expect to see something like a Gigaflop per cpu.  I 
> have 8 dual processor nodes (I generally exclude the head node from runs unless I 
> forget) so I would expect something like 16 Gflops.  The highest benchmark I have 
> ever been able to squeeze out of the thing is 8 Gflops and that was with the problem 
> size set so high it would occasionally overflow my available memory and crash.
> 
> Any thoughts on how I can improve or are these results reasonable?  I am more 
> interested in general cluster improvement then tuning the linpack program itself.  I 
> heard the 2.6 kernel was beter for SMP performance for example, but I wasn't sure 
> how big a difference it would make in the benchmarking.  I haven't gotten my hands 
> dirty with kernel stuff yet, and I am reluctant to mess with it unless it is likely 
> to make a large difference.

> Cluster does fine with my problem, I am tweaking it mostly for ego value and 
> educational value.  :)
> 
> If this is outside the topic of this list, I would be interested in knowing what 
> forum would be more appropriate for general clustering questions like this.
> 
> Posibily pertinant info:
> nodes use an integrated NIC (probably generic ethernet)
> 3Com SuperStackII Hub (10/100 TX I think)
> Head node is a PIII 450 (which is why I leave it out of calcs)
> using "mpirun n1-8 -np 16 xhpl" to run the program
> Nodes have 1G Ram and dual Athalon 2600 CPUs
> Oscar 3.0 on RH9
> Kernel: 2.4.20-8smp
> 
> 
> 
> -------------------------------------------------------
> SF.Net is sponsored by: Speed Start Your Linux Apps Now.
> Build and deploy apps & Web services for Linux with
> a free DVD software kit from IBM. Click Now!
> http://ads.osdn.com/?ad_id56&alloc_id438&op=click
> _______________________________________________
> Oscar-users mailing list
> [EMAIL PROTECTED]
> https://lists.sourceforge.net/lists/listinfo/oscar-users
-- 
ac3
Suite G16, Bay 7, Locomotive Workshop   Phone:  02 9209 4600
Australian Technology Park              Fax:    02 9209 4611
Eveleigh   NSW   1430

============================================================================
HPLinpack 1.0  --  High-Performance Linpack benchmark  --  September 27, 2000
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Labs.,  UTK
============================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :   40000   175000   176000   177000   178000   179000   180000   181000 
          182000 
NB     :     200 
P      :       8 
Q      :      38 
PFACT  :   Right 
NBMIN  :       8 
NDIV   :       2 
RFACT  :    Left 
BCAST  :  1ringM 
DEPTH  :       1 
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

----------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual checks will be computed:
   1) ||Ax-b||_oo / ( eps * ||A||_1  * N        )
   2) ||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  )
   3) ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo )
- The relative machine precision (eps) is taken to be          1.110223e-16
- Computational tests pass if scaled residuals are less than           16.0

============================================================================
T/V                N    NB     P     Q               Time             Gflops
----------------------------------------------------------------------------
W11L2R8        40000   200     8    38              78.36          5.445e+02
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1  * N        ) =        0.0158003 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  ) =        0.0144471 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =        0.0028370 ...... PASSED
============================================================================
T/V                N    NB     P     Q               Time             Gflops
----------------------------------------------------------------------------
W11L2R8       175000   200     8    38            3331.72          1.072e+03
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1  * N        ) =        0.0673868 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  ) =        0.0089810 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =        0.0015430 ...... PASSED

============================================================================
HPLinpack 1.0  --  High-Performance Linpack benchmark  --  September 27, 2000
Written by A. Petitet and R. Clint Whaley,  Innovative Computing Labs.,  UTK
============================================================================

An explanation of the input/output parameters follows:
T/V    : Wall time / encoded variant.
N      : The order of the coefficient matrix A.
NB     : The partitioning blocking factor.
P      : The number of process rows.
Q      : The number of process columns.
Time   : Time in seconds to solve the linear system.
Gflops : Rate of execution for solving the linear system.

The following parameter values will be used:

N      :   40000   175000   176000   177000   178000   179000   180000   181000 
          182000 
NB     :     200 
P      :       8 
Q      :      38 
PFACT  :   Right 
NBMIN  :       8 
NDIV   :       2 
RFACT  :    Left 
BCAST  :  1ringM 
DEPTH  :       1 
SWAP   : Mix (threshold = 64)
L1     : transposed form
U      : transposed form
EQUIL  : yes
ALIGN  : 8 double precision words

----------------------------------------------------------------------------

- The matrix A is randomly generated for each test.
- The following scaled residual checks will be computed:
   1) ||Ax-b||_oo / ( eps * ||A||_1  * N        )
   2) ||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  )
   3) ||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo )
- The relative machine precision (eps) is taken to be          1.110223e-16
- Computational tests pass if scaled residuals are less than           16.0

============================================================================
T/V                N    NB     P     Q               Time             Gflops
----------------------------------------------------------------------------
W11L2R8        40000   200     8    38              77.31          5.519e+02
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1  * N        ) =        0.0175013 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  ) =        0.0160024 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =        0.0031424 ...... PASSED
============================================================================
T/V                N    NB     P     Q               Time             Gflops
----------------------------------------------------------------------------
W11L2R8       175000   200     8    38            3311.22          1.079e+03
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1  * N        ) =        0.0627572 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  ) =        0.0083640 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =        0.0014369 ...... PASSED
============================================================================
T/V                N    NB     P     Q               Time             Gflops
----------------------------------------------------------------------------
W11L2R8       176000   200     8    38            3360.40          1.082e+03
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1  * N        ) =        0.0094875 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  ) =        0.0077927 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =        0.0013420 ...... PASSED
============================================================================
T/V                N    NB     P     Q               Time             Gflops
----------------------------------------------------------------------------
W11L2R8       177000   200     8    38            3418.90          1.081e+03
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1  * N        ) =        0.0043984 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  ) =        0.0085455 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =        0.0015046 ...... PASSED
============================================================================
T/V                N    NB     P     Q               Time             Gflops
----------------------------------------------------------------------------
W11L2R8       178000   200     8    38            3460.28          1.087e+03
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1  * N        ) =        0.0382668 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  ) =        0.0082226 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =        0.0014446 ...... PASSED
============================================================================
T/V                N    NB     P     Q               Time             Gflops
----------------------------------------------------------------------------
W11L2R8       179000   200     8    38            3490.59          1.095e+03
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1  * N        ) =        0.0162521 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  ) =        0.0086945 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =        0.0014488 ...... PASSED
============================================================================
T/V                N    NB     P     Q               Time             Gflops
----------------------------------------------------------------------------
W11L2R8       180000   200     8    38            3569.05          1.089e+03
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1  * N        ) =        0.0039023 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  ) =        0.0081021 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =        0.0013326 ...... PASSED
============================================================================
T/V                N    NB     P     Q               Time             Gflops
----------------------------------------------------------------------------
W11L2R8       181000   200     8    38            3665.82          1.078e+03
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1  * N        ) =        0.0046288 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  ) =        0.0083695 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =        0.0014308 ...... PASSED
============================================================================
T/V                N    NB     P     Q               Time             Gflops
----------------------------------------------------------------------------
W11L2R8       182000   200     8    38            4589.01          8.758e+02
----------------------------------------------------------------------------
||Ax-b||_oo / ( eps * ||A||_1  * N        ) =        0.0094461 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_1  * ||x||_1  ) =        0.0089341 ...... PASSED
||Ax-b||_oo / ( eps * ||A||_oo * ||x||_oo ) =        0.0015777 ...... PASSED
============================================================================

Finished      9 tests with the following results:
              9 tests completed and passed residual checks,
              0 tests completed and failed residual checks,
              0 tests skipped because of illegal input values.
----------------------------------------------------------------------------

End of Tests.
============================================================================

Reply via email to