Guillaume Michal wrote:
> Hi all,
> We set up our first cluster in our faculty this week. As we are new to cluster computing, there is a lot to learn. We performed some linpack test using the OpenMPI benchmark available in the Rocks 4.3 distribution. The system is as follow:
>  - GigB ethernet with switch HP Procurve 2800 series
>  - 1 Master node: 500GB sata HDD, two intel quad core [EMAIL PROTECTED], 2GB 
mem
>  - 4 nodes each having: 80GB sata HDD, two intel quad core [EMAIL PROTECTED], 
8GB mem
>
> First I'm a bit confused by the parameters P and Q in HPL.dat and how to use them properly. I noticed a 4P 2Q test is not equivalent to a 2P 4Q, generally speaking it does not commute. Why? What is clearly P and Q then: P for number of processors per nodes and Q for the number of nodes?
>

Visualize the problem as a big 2d matrix.  P and Q represent how the problem
is divided.  In general, the best is when the matrix is divided into even 
squares.
If your core count isn't n^2, then P and Q have to be different.  From 
experience
P should always be less than Q.  There may be a computational reason for that
(ie, longer strides in memory), but I am not sure.


> Secondly, what is the definition of processor for a quad core architecture? I 
suppose a quad core should be counted as 4 processors.

Yes, unless you are using a multithreaded BLAS library.  If you are,
you should have each node be 1 process.

>
> I launched Linpack using Ns=10000 and various configuration for P and Q. At the moment I got a maximum of 78 Gflops using P=8 Q=4 -> 32 processors.

You want to use as much available memory as possible.  I use N=10000 on a
single processor, single core run with 1GB.   You can figure out a good
value of N by the following formula:

Ns=sqrt(<Memory in Bytes per core>*<Number of cores>/8)

The 8 represents the size of a double.  For <Memory in Bytes per core>, I try
to use the largest number possible, typically about 90% of max.  You never
want to go into swap during these calculations (or, have it crash because
you have diskless nodes).

Ex: If you have 2GB per core for 32p, should use Ns as:

Ns=sqrt(1900*1024*1024*32/8)
Ns=89270

Honestly, this may be overkill.  At some point, the working memory set will
be large enough so that FP performance will be the bottleneck.  I would
start with smaller numbers (say half) and work your way up to understand
what is going on.  In any case, using Ns=10000 is way to small.

>
> If I'm right the peak performance should be Rpeak= 4 cores x 4 floting point 
op per cycle x 2.33 Ghz x 8 quad cores = 298 Gflops.
> Which would lead to a test running at ~25% Rpeak.
>
> This is very low and I see 3 causes for the problem:
>     - I miscalculated Rpeak
>     - P and Q are not set properly
>     - there is a serious bottelneck
>

I think your Rpeak calculation is correct (not sure how many FPs the latest
Intel chips can do).

If increasing Ns doesn't help, run smaller cases on a per node bases (using
all available memory for each node).  If you don't get the exact same
answer on every node (or at least with 2%), you have a problem.  Figure out
what is wrong with the slow nodes.  Also, run the test multiple times
on the same node and verify consistent performance.

Craig




> Thanks for your advices
>
> Guillaume
>
>
> --Using Opera's revolutionary e-mail client: http://www.opera.com/mail/
>
> _______________________________________________
> Beowulf mailing list, [email protected]
> To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf
>


--
Craig Tierney ([EMAIL PROTECTED])
_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to