Hi Jeff,

My apologies for the delay in replying, I was flying back from the UK
to the States, but now I'm here and I can provide a more timely
response.

> I confirm that the hwloc message you sent (and your posts to the
> hwloc-users list) indicate that hwloc is getting confused by a buggy
> BIOS, but it's only dealing with the L3 cache, and that shouldn't
> affect the binding that OMPI is doing.

Great, good to know. I'd still be interested in learning how to build a
hwloc-parsable xml as a workaround, especially if it fixes the bindings
(see below).

> 1. Run with "--report-bindings" and send the output.  It'll
> prettyprint-render where OMPI thinks it is binding each process.

Please find it attached.

> 2. Run with "--bind-to none" and see if that helps.  I.e., if, per
> #1, OMPI thinks it is binding correctly (i.e., each of the 48
> processes is being bound to a unique core), then perhaps hwloc is
> doing something wrong in the actual binding (i.e., binding the 48
> processes only among the lower 32 cores).

BINGO! As soon as I did this, indeed all the cores went to 100%! Here's
the updated timing (compared to 13 minutes from before):

        real    1m8.442s
        user    0m0.077s
        sys     0m0.071s

So I guess the conclusion is that hwloc is somehow messing things up on
this chipset?

Thanks,
Andrej

Attachment: test_report_bindings.stderr
Description: Binary data

Reply via email to