Hi Joe, Median changes by more than factor of 2. And the distribution tail is > *huge*. > FWIW: 6.2 was a terrible release. If you have to use pure RHEL, get to > 6.5+. And there are many tunables you need to look at. >
Thanks for your reply - I may look into asking our IT squad to put 6.5 on a set of nodes for testing, but playing with the tunables is probably the first step. I don't have root access and can't switch things up, but a few of the power options (eg, /sys/module/pcie_aspm/parameters/policy) are already looking like decent things to switch around, as that's in a 'power save' state currently on the poorly performing nodes, whereas it doesn't even exist on the 5.5 nodes. > Bigger view ... have you isolated a CPU for IB handling, so at 7 cores, > your machine is full (1 for IB and 7 for apps), but at 8 cores you are > contending for resources (8 for apps + 1 for IB)? > Are you running the app with taskset (explicitly or implicitly)? > In the test we're running, there isn't any local processing outside of the communication, really - each task, bound to its own core, is simply sending messages, in a giant loop. While there are clearly 8 cores all talking to 1 IB device, each one (I believe) mmaps its own range and handles its own message processing, and furthermore this definitely works before, so it doesn't seem like a resource contention issue unless it's something to do with mmap on the versions we're running. I did double check that we're not having processes migrating between cores, though. Mostly, I'm poking around kernel tunables right now and making a list of things that might indicate the issue. I'll also take a deeper look at /proc/interrupts during a run soon, too. Thanks again, - Brian
_______________________________________________ Beowulf mailing list, [email protected] sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
