Samar Khatiwala <[email protected]> writes: > Hi Jed, > > Thanks for the quick reply. This is very helpful. You may well be right that > my matrices are not large enough > (~2. 5e6 x 2.5e6 and I'm running on 360 cores = 15 nodes x 24 cores/node on > this XC-30) and my runs are > therefore sensitive to network latency. Would this, though, impact other > people running jobs on nearby nodes? > (I suppose it would if I'm passing too many messages because of the small > size of the matrices.)
It depends on your partition. The Aries network on XC-30 is a high-radix low-diameter network. There should be many routes between nodes, but the routing algorithm likely does not know which wires to avoid. This leads to performance variation, though I think it should tend to be less extreme than when you obtain disconnected partitions on Gemini. The gold standard of reproducible performance is Blue Gene, where the network is reconfigured to give you an isolated 5D torus. A Blue Gene may or may not be available or cost effective (reproducible performance does not imply high performance/efficiency for a given workload).
pgpgv5_3BthZy.pgp
Description: PGP signature
