On 04/24/2014 11:31 AM, Brian Dobbins wrote:
Hi everyone,
We're having a problem with one of our clusters after it was upgraded
to RH6.2 (from CentOS5.5) - the performance of our Infiniband network
degrades randomly and severely when using all 8 cores in our nodes for
MPI,... but not when using only 7 cores per node.
For example, I have a hacked-together script (below) that does a
sequence of 20 sets of fifty MPI_Allreduce tests via the Intel MPI
benchmarks, and then calculates statistics on the average times per
individual set. For our 'good' (CentOS 5.5) nodes, we see consistent
results:
% perftest hosts_c20_8c.txt
Min. 1st Qu. Median Mean 3rd Qu. Max.
176.0 177.3 182.6 182.8 186.1 196.9
% perftest hosts_c20_8c.txt
Min. 1st Qu. Median Mean 3rd Qu. Max.
176.3 180.4 184.8 187.0 189.1 213.5
... But for our tests on the RH6.2 install, we see enormous variance:
% perftest hosts_c18_8c.txt
Min. 1st Qu. Median Mean 3rd Qu. Max.
176.8 185.9 217.0 347.6 387.7 1242.0
% perftest hosts_c18_8c.txt
Min. 1st Qu. Median Mean 3rd Qu. Max.
178.2 204.5 390.5 329.6 409.4 493.1
Note that the minimums are similar -- not /every/ run experiences
this jitter - and in the case of the first run of the script, even the
median value is pretty decent, so seemingly only a few of the tests were
high. But the maximum is enormous. Each of these tests are run one
right after the other, and strangely it seems to always differ between
/instances/ of the IMB code, not in individual loops -eg, one of the
fifty runs inside an individual call. Those all seem consistent, so
that's either luck, or some issue on mapping the IB device, or some
interrupt issue in the kernel, etc.
Median changes by more than factor of 2. And the distribution tail is
*huge*.
FWIW: 6.2 was a terrible release. If you have to use pure RHEL, get to
6.5+. And there are many tunables you need to look at.
Bigger view ... have you isolated a CPU for IB handling, so at 7 cores,
your machine is full (1 for IB and 7 for apps), but at 8 cores you are
contending for resources (8 for apps + 1 for IB)?
Are you running the app with taskset (explicitly or implicitly)?
--
Joseph Landman, Ph.D
Founder and CEO
Scalable Informatics, Inc.
email: [email protected]
web : http://scalableinformatics.com
twtr : @scalableinfo
phone: +1 734 786 8423 x121
cell : +1 734 612 4615
_______________________________________________
Beowulf mailing list, [email protected] sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit
http://www.beowulf.org/mailman/listinfo/beowulf