On Sun, Nov 18, 2007 at 10:27:23AM +0200, Eli Cohen wrote: > Can you tell how IPOIB is configured - connected mode or datagram mode? > Also can you send more context from /var/log/messages? Especially can > you rerun with debug enabled and send the output? > Enabling debug can be done by: > echo 1 > /sys/module/ib_ipoib/parameters/debug_level
Yes, it's connected mode. Here another log of on overrun with "debug_level=1". I added code to dump the CQ context table (just did a QUERY_CQ and logged the result). 15:50:39 r3i1n2 kernel: ib0: Send unicast ARP to 0165 15:50:42 r3i1n2 kernel: ib0: neigh_destructor for 000404 fe80:0000:0000:0000:0008:f104:0398:8595 15:50:42 r3i1n2 kernel: ib0: Reap connection for gid fe80:0000:0000:0000:0008:f104:0398:8595 15:50:42 r3i1n2 kernel: ib0: Destroy active connection 0xf048d head 0x2 tail 0x2 15:50:53 r3i1n2 in.rshd[7056]: connect from 10.148.0.9 (10.148.0.9) 15:50:53 r3i1n2 kernel: ib_mthca 0000:06:00.0: CQ overrun on CQN 240082 15:50:53 r3i1n2 kernel: cq_context = 0xffff8101eee9c000 15:50:53 r3i1n2 kernel: flags = 0x90000900 15:50:53 r3i1n2 kernel: start_hi = 0x0 15:50:53 r3i1n2 kernel: start_lo = 0x0 15:50:53 r3i1n2 kernel: logsize_usrpage = 0x7000002 15:50:53 r3i1n2 kernel: comp_eqn = 0x1 15:50:53 r3i1n2 kernel: pd = 0x4 15:50:53 r3i1n2 kernel: lkey = 0xd0108900 15:50:53 r3i1n2 kernel: last_notified_index = 0x217 15:50:53 r3i1n2 kernel: solicit_producer_index = 0x9c18 15:50:53 r3i1n2 kernel: consumer_index = 0x0 15:50:53 r3i1n2 kernel: producer_index = 0x218 15:50:53 r3i1n2 kernel: cqn = 0x240082 15:50:53 r3i1n2 kernel: ci_db = 0x7ffd 15:50:53 r3i1n2 kernel: state_db = 0x1 15:50:58 r3i1n2 kernel: ib1: Send unicast ARP to 016d 15:50:58 r3i1n2 kernel: ib0: Send unicast ARP to 0165 15:51:11 r3i1n2 in.rshd[7057]: connect from 10.148.0.9 (10.148.0.9) 15:51:27 r3i1n2 kernel: ib0: REQ arrived 15:51:31 r3i1n2 kernel: ib0: Send unicast ARP to 0165 15:51:32 r3i1n2 kernel: ib1: Send unicast ARP to 016d 15:51:32 r3i1n2 kernel: ib0: Send unicast ARP to 00ac 15:51:42 r3i1n2 in.rshd[7058]: connect from 10.148.0.9 (10.148.0.9) 15:52:12 r3i1n2 in.rlogind[7059]: connect from 10.148.0.9 (10.148.0.9) 15:52:17 r3i1n2 kernel: ib1: Send unicast ARP to 016d 15:52:22 r3i1n2 kernel: ib0: Send unicast ARP to 0165 15:52:54 r3i1n2 in.rlogind[7060]: connect from 192.168.159.1 (192.168.159.1) 15:52:54 r3i1n2 rlogind[7060]: pam_rhosts_auth(rlogin:auth): allowed to [EMAIL PROTECTED] as root 15:52:59 r3i1n2 kernel: ib1: Send unicast ARP to 016d 15:53:11 r3i1n2 kernel: ib0: Send unicast ARP to 0165 15:53:32 r3i1n2 kernel: ib0: Send unicast ARP to 00ac 15:54:14 r3i1n2 kernel: ib0: Send unicast ARP to 0165 15:54:19 r3i1n2 kernel: ib1: Send unicast ARP to 016d 15:54:26 r3i1n2 kernel: ib_mthca 0000:06:00.0: mthca_create_cq: cq = 0xffff81015a3ee7c0 cqn = 0x350090 15:54:26 r3i1n2 kernel: ib0: ipoib_cm_tx_init: ib_create_cq returns 0xffff81022523b1c0 15:54:26 r3i1n2 kernel: ib0: Request connection 0x13048f for gid fe80:0000:0000:0000:0008:f104:0398:8595 qpn 0x404 15:54:26 r3i1n2 kernel: ib0: REP received. 15:54:43 r3i1n2 in.rshd[7061]: connect from 192.168.159.1 (192.168.159.1) 15:54:43 r3i1n2 rshd[7061]: pam_rhosts_auth(rsh:auth): allowed to [EMAIL PROTECTED] as root 15:54:48 r3i1n2 kernel: ib0: Send unicast ARP to 0165 15:55:03 r3i1n2 login[4750]: resmgr: unable to connect to resmgrd: No such file or directory 15:55:03 r3i1n2 login[4750]: resmgr login failed 15:55:23 r3i1n2 kernel: ib0: Send unicast ARP to 0165 15:55:28 r3i1n2 kernel: ib1: Send unicast ARP to 016d 15:55:30 r3i1n2 kernel: ib0: TX ring 0xf00405 full, stopping kernel net queue 15:55:32 r3i1n2 kernel: NETDEV WATCHDOG: ib0: transmit timed out 15:55:32 r3i1n2 kernel: ib0: transmit timeout: latency 1688 msecs 15:55:32 r3i1n2 kernel: ib0: queue stopped 1, tx_head 13657, tx_tail 13657 15:55:33 r3i1n2 kernel: NETDEV WATCHDOG: ib0: transmit timed out Looking at the contents of the CQ context table (right after the overrun at 15:50:53), do the producer and consumer indices look reasonable? I expected to find that producer_index + 1 == consumer_index. -- Arthur _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
