Can you tell how IPOIB is configured - connected mode or datagram mode? Also can you send more context from /var/log/messages? Especially can you rerun with debug enabled and send the output? Enabling debug can be done by: echo 1 > /sys/module/ib_ipoib/parameters/debug_level
On Fri, 2007-11-16 at 17:23 -0600, Chris Elmquist wrote: > On Thursday (11/15/2007 at 12:23PM -0800), [EMAIL PROTECTED] wrote: > > > > We have a large (~1800 node) IB cluster of x86_64 machines, and > > we're having some significant problems with IPoIB. > > > > The thing that all the IPoIB failures have in common seems to be > > an appearance of a "CQ overrun" in syslog, e.g.: > > > > ib_mthca 0000:06:00.0: CQ overrun on CQN 180082 > > > > >From there things go badly in different ways - tx_timeouts, > > oopses, etc. Sometimes things just start working again after > > a few minutes. > > > > The appearance of these failures seems to be well correlated > > with the size of the machine. I don't think there any problems > > until the machine is built up to about its maximum size, and > > then they become pretty common. > > > > We are using MT25204 HCAs with 1.2.0 firmware, and OFED 1.2. > > > > Does this ring a bell with anyone? > > I can perhaps elaborate a little more on the test case we are using to > expose this situation... > > On 1024 (or more) nodes, nttcp -i is started as a "tcp socket server". > Eight copies are started, each on a different tcp port (5000 ... 5007). > > On another client node, as few as 1024 and as many as 8192 nttcp clients > are launched from that node to all of the 1024 others. We can have > one connection between the client and each node or we can have eight > connections between the client and each node. The nttcp test is run > for 120 secs and in these scenarios, all connections get established, > nttcp moves data, and never fails. We get expected performance. > > If the node count is increased to 1152, then things start to become > unreliable. We will see connections fail to be established when we try > to do 8 per node. If we do one per node, they will all establish and run. > In fact, we can do one per node across 1664 and that will succeed also. > > So the problem seems to be related to the total number of nodes on > the fabric as well as how many TCP connections you try to establish to > each node. > > One is tempted to believe it is a problem at the single node that is > opening all of these connections to the others... but the failure occurs > on the nodes being connected to-- the nttcp servers-- with the CQ overrun > and TX WATCHDOG TIMEOUTS, etc. The final outcome of which is that we loose > all TCP connectivity over IB to the affect nodes for some period of time. > Sometimes they come back, sometimes they don't and sometimes its seconds > and sometimes its minutes before they come back. Not very deterministic. > > cje _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
