Samar Khatiwala <[email protected]> writes:

> Hi Jed,
>
> Thanks for the quick reply. This is very helpful. You may well be right that 
> my matrices are not large enough 
> (~2. 5e6 x 2.5e6 and I'm running on 360 cores = 15 nodes x 24 cores/node on 
> this XC-30) and my runs are 
> therefore sensitive to network latency. Would this, though, impact other 
> people running jobs on nearby nodes? 
> (I suppose it would if I'm passing too many messages because of the small 
> size of the matrices.)

It depends on your partition.  The Aries network on XC-30 is a
high-radix low-diameter network.  There should be many routes between
nodes, but the routing algorithm likely does not know which wires to
avoid.  This leads to performance variation, though I think it should
tend to be less extreme than when you obtain disconnected partitions on
Gemini.

The gold standard of reproducible performance is Blue Gene, where the
network is reconfigured to give you an isolated 5D torus.  A Blue Gene
may or may not be available or cost effective (reproducible performance
does not imply high performance/efficiency for a given workload).

Attachment: pgpgv5_3BthZy.pgp
Description: PGP signature

Reply via email to