Thanks again Jed. This has definitely helped narrow down the possibilities.
Best, Samar On Apr 11, 2014, at 8:41 AM, Jed Brown <[email protected]> wrote: > Samar Khatiwala <[email protected]> writes: > >> Hi Jed, >> >> Thanks for the quick reply. This is very helpful. You may well be right that >> my matrices are not large enough >> (~2. 5e6 x 2.5e6 and I'm running on 360 cores = 15 nodes x 24 cores/node on >> this XC-30) and my runs are >> therefore sensitive to network latency. Would this, though, impact other >> people running jobs on nearby nodes? >> (I suppose it would if I'm passing too many messages because of the small >> size of the matrices.) > > It depends on your partition. The Aries network on XC-30 is a > high-radix low-diameter network. There should be many routes between > nodes, but the routing algorithm likely does not know which wires to > avoid. This leads to performance variation, though I think it should > tend to be less extreme than when you obtain disconnected partitions on > Gemini. > > The gold standard of reproducible performance is Blue Gene, where the > network is reconfigured to give you an isolated 5D torus. A Blue Gene > may or may not be available or cost effective (reproducible performance > does not imply high performance/efficiency for a given workload).
