I am with Joe regarding looking at the interrupts. However, could this be a difference with the power management with the Redhat kernel? ie. when running on 8 cores you are tripping over some thermal threshold and causing a throttle back to a lower C-state?
Can you give the kernel versions for both setups? On 24 April 2014 16:56, Joe Landman <[email protected]> wrote: > On 04/24/2014 11:31 AM, Brian Dobbins wrote: > >> >> Hi everyone, >> >> We're having a problem with one of our clusters after it was upgraded >> to RH6.2 (from CentOS5.5) - the performance of our Infiniband network >> degrades randomly and severely when using all 8 cores in our nodes for >> MPI,... but not when using only 7 cores per node. >> >> For example, I have a hacked-together script (below) that does a >> sequence of 20 sets of fifty MPI_Allreduce tests via the Intel MPI >> benchmarks, and then calculates statistics on the average times per >> individual set. For our 'good' (CentOS 5.5) nodes, we see consistent >> results: >> >> % perftest hosts_c20_8c.txt >> Min. 1st Qu. Median Mean 3rd Qu. Max. >> 176.0 177.3 182.6 182.8 186.1 196.9 >> % perftest hosts_c20_8c.txt >> Min. 1st Qu. Median Mean 3rd Qu. Max. >> 176.3 180.4 184.8 187.0 189.1 213.5 >> >> ... But for our tests on the RH6.2 install, we see enormous variance: >> >> % perftest hosts_c18_8c.txt >> Min. 1st Qu. Median Mean 3rd Qu. Max. >> 176.8 185.9 217.0 347.6 387.7 1242.0 >> % perftest hosts_c18_8c.txt >> Min. 1st Qu. Median Mean 3rd Qu. Max. >> 178.2 204.5 390.5 329.6 409.4 493.1 >> >> Note that the minimums are similar -- not /every/ run experiences >> >> this jitter - and in the case of the first run of the script, even the >> median value is pretty decent, so seemingly only a few of the tests were >> high. But the maximum is enormous. Each of these tests are run one >> right after the other, and strangely it seems to always differ between >> /instances/ of the IMB code, not in individual loops -eg, one of the >> >> fifty runs inside an individual call. Those all seem consistent, so >> that's either luck, or some issue on mapping the IB device, or some >> interrupt issue in the kernel, etc. >> > > Median changes by more than factor of 2. And the distribution tail is > *huge*. > > FWIW: 6.2 was a terrible release. If you have to use pure RHEL, get to > 6.5+. And there are many tunables you need to look at. > > Bigger view ... have you isolated a CPU for IB handling, so at 7 cores, > your machine is full (1 for IB and 7 for apps), but at 8 cores you are > contending for resources (8 for apps + 1 for IB)? > > Are you running the app with taskset (explicitly or implicitly)? > > > > > -- > Joseph Landman, Ph.D > Founder and CEO > Scalable Informatics, Inc. > email: [email protected] > web : http://scalableinformatics.com > twtr : @scalableinfo > phone: +1 734 786 8423 x121 > cell : +1 734 612 4615 > _______________________________________________ > Beowulf mailing list, [email protected] sponsored by Penguin Computing > To change your subscription (digest mode or unsubscribe) visit > http://www.beowulf.org/mailman/listinfo/beowulf >
_______________________________________________ Beowulf mailing list, [email protected] sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
