Re: [Beowulf] Infiniband PortXmitWait problems on IBM Sandybridge iDataplex with Mellanox ConnectX-3

Peter Kjellström Wed, 12 Jun 2013 07:18:46 -0700

On Wednesday 12 June 2013 15:03:11 Christopher Samuel wrote:
> Hi folks,


In my experience bad cables / unstable fabric (on FDR + CX3 clusters) show up 
by one or more of:

 * LinkDownedCounter ticking up
 * SymbolErrorCounter ticking up
 * ports at FDR10 not FDR (use for example iblinkinfo / ibdiagnet)

PortXmitWait is expected to increase (a lot).

Some ideas:

 * Use something simple but large scale to test IB before HPL.
 * Run HPL with show-as-you-go and plot that data (even but low == bad HPL 
   config or low perf node / blas, uneven plot == bad fabric).
 * Run HPL on ethernet, you should be able to get >50% on 65 nodes unless your
   eth is too weak.
 * Make sure ranks are placed properly and consume expected amount of ram,
   also check pinning if used (top, lstopo --ps, taskset -p PID, ..).

With your config you should be able to get to ~90% HPL efficiency.

Cheers,
 Peter

> I'm doing the bring up and testing on our SandyBridge IBM iDataplex
> with an FDR switch and as part of that I've been doing burn-in testing
> with HPL and seeing really poor efficiency (~25% over 65 odd nodes
> with 256GB RAM).  Simultaneously HPL on the 3 nodes with 512GB RAM
> gives ~70% efficiency.
> 
> Checking the switch with ibqueryerrors shows lots of things like:
> 
>    GUID 0x2c90300771450 port 22: [PortXmitWait == 198817026]
> 
> That's about 2 or 3 hours after last clearing the counters. :-(
> 
> Doing:
> 
> # ibclearcounters && ibclearerrors && sleep 1 && ibqueryerrors
> 
> Shows 75 of 94 nodes bad, pretty much all with thousands of
> PortXmitWait, some into the 10's of thousands.
> 
> We are running RHEL 6.3, Mellanox OFED 2.0.5, FDR IB and Open-MPI 1.6.4.
> 
> Talking with another site who also has the same sort of iDataplex, but
> running RHEL 5.8, Mellanox OFED 1.5 and QDR I, reveals that they (once
> they started looking) are also seeing high PortXmitWait counters
> shortly after clearing them with user codes.
> 
> These are Mellanox MT27500 ConnectX-3 adapters.
> 
> We're talking with both IBM and Mellanox directly, but other than
> Mellanox spotting some GPFS NSD file servers that had bad FDR ports
> (which got unplugged last week and fixed today) we've not made any
> progress into the underlying cause. :-(
> 
> Has anyone seen anything like this before?
> 
> cheers!
> Chris

signature.asc
Description: This is a digitally signed message part.

_______________________________________________
Beowulf mailing list, [email protected] sponsored by Penguin Computing
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Infiniband PortXmitWait problems on IBM Sandybridge iDataplex with Mellanox ConnectX-3

Reply via email to