We have a large (~1800 node) IB cluster of x86_64 machines, and we're having some significant problems with IPoIB.
The thing that all the IPoIB failures have in common seems to be an appearance of a "CQ overrun" in syslog, e.g.: ib_mthca 0000:06:00.0: CQ overrun on CQN 180082 >From there things go badly in different ways - tx_timeouts, oopses, etc. Sometimes things just start working again after a few minutes. The appearance of these failures seems to be well correlated with the size of the machine. I don't think there any problems until the machine is built up to about its maximum size, and then they become pretty common. We are using MT25204 HCAs with 1.2.0 firmware, and OFED 1.2. Does this ring a bell with anyone? -- Arthur _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
