Isaac Huang wrote: > On Mon, Aug 17, 2009 at 12:23:35PM -0400, Charles A. Taylor wrote: >> FWIW, I posted this to ofa-general a little earlier. Anyone else >> seeing this? Suggestions? I think this is an OFED 1.4.1 problem >> but they may point the finger at you guys. :) >> >> We've tried limiting OST threads to no avail. It doesn't really seem >> to require a heavy load to trigger it - more or less random. > > I wouldn't think it's directly caused by Lustre. The IPoIB interface > is only needed for address resolution - no Lustre traffic would end up > sitting in the IPoIB interface's TX queue.
We are using a tcp NID on the (troubled) ib1 interfaces to reach our non-IB hosts. We have o2ib NIDs on ib0 (dual-port HCA) to reach the InfiniBand-connected hosts on the same subnet. No problems there. > Have you tried to stress IPoIB, without Lustre running, with a TCP/IP > benchmark (e.g. Netperf, Iperf, NetPIPE) or simply a 'ping -f'? We've tried to stress IPoIB with netperf TCP_STREAM on a spare OSS (same hardware, same connectivity) running the same Lustre kernel. No trouble so far. Cheers, Craig Prescott UF HPC Center > Isaac > >> ...... >> Aug 17 09:46:59 hpcio8 kernel: NETDEV WATCHDOG: ib1: transmit timed out >> Aug 17 09:46:59 hpcio8 kernel: ib1: transmit timeout: latency 347449 >> msecs >> Aug 17 09:46:59 hpcio8 kernel: ib1: queue stopped 1, tx_head 868165770, >> tx_tail 868165647 >> >> The difference between the head/tail is always 123. The send queue >> size is 128 according to... >> >> cat /sys/module/ib_ipoib/parameters/send_queue_size >> 128 > _______________________________________________ > Lustre-discuss mailing list > [email protected] > http://lists.lustre.org/mailman/listinfo/lustre-discuss _______________________________________________ Lustre-discuss mailing list [email protected] http://lists.lustre.org/mailman/listinfo/lustre-discuss
