Hi Jeff, My apologies for the delay in replying, I was flying back from the UK to the States, but now I'm here and I can provide a more timely response.
> I confirm that the hwloc message you sent (and your posts to the > hwloc-users list) indicate that hwloc is getting confused by a buggy > BIOS, but it's only dealing with the L3 cache, and that shouldn't > affect the binding that OMPI is doing. Great, good to know. I'd still be interested in learning how to build a hwloc-parsable xml as a workaround, especially if it fixes the bindings (see below). > 1. Run with "--report-bindings" and send the output. It'll > prettyprint-render where OMPI thinks it is binding each process. Please find it attached. > 2. Run with "--bind-to none" and see if that helps. I.e., if, per > #1, OMPI thinks it is binding correctly (i.e., each of the 48 > processes is being bound to a unique core), then perhaps hwloc is > doing something wrong in the actual binding (i.e., binding the 48 > processes only among the lower 32 cores). BINGO! As soon as I did this, indeed all the cores went to 100%! Here's the updated timing (compared to 13 minutes from before): real 1m8.442s user 0m0.077s sys 0m0.071s So I guess the conclusion is that hwloc is somehow messing things up on this chipset? Thanks, Andrej
test_report_bindings.stderr
Description: Binary data