Hello, IHAC that is experiencing a problem with IB. Specifically, when placing the Infinihost III card in connected mode with 'echo connected > /sys/class/net/ib0/mode' some nodes stop responding. By 'stop responding' I mean:
- ping <ib ip address> doesn't work (no packets returned; 100% packet loss) - ib_rdma_bw -b node never runs - ibping does work since the customer is mounting their nfs server over IB nfs services stop working when in connected mode. What is interesting is if I leave the nfs server in datagram mode then the affected nodes can still interact with the nfs server, ie., nfs service continues to work, but I can not communicate over IB with other nodes that are also in connected mode. At first I thought this was only a problem with IPoIB. I note the following difference between nodes that do not work in connected mode and nodes that do. The first output is from a node that stops working, the second from a node that continues to work. [r...@ws3 ~]# modinfo ib_ipoib filename: /lib/modules/2.6.18-128.el5/kernel/drivers/infiniband/ulp/ipoib/ib_ipoib.ko license: Dual BSD/GPL description: IP-over-InfiniBand net driver author: Roland Dreier srcversion: E3C28A100A995101E2AB934 depends: ib_cm,ipv6,ib_core,ib_sa vermagic: 2.6.18-128.el5 SMP mod_unload gcc-4.1 parm: max_nonsrq_conn_qp:Max number of connected-mode QPs per interface (applied only if shared receive queue is not available) (int) parm: set_nonsrq:set to dictate working in none SRQ mode, otherwise act according to device capabilities (int) parm: mcast_debug_level:Enable multicast debug tracing if > 0 (int) parm: send_queue_size:Number of descriptors in send queue (int) parm: recv_queue_size:Number of descriptors in receive queue (int) parm: debug_level:Enable debug tracing if > 0 (int) module_sig: 883f35049492f615cdc734e64d24fa112659309d1b9619270a5e84a97a46cbc6e4ac0908b21f20a0a75b803bc72eba1ce62d2a8eec53fd9c2d7288c [r...@ws3 ~]# [r...@scyld ~]# modinfo ib_ipoib filename: /lib/modules/2.6.18-128.1.1.el5.530g0000/kernel/drivers/infiniband/ulp/ipoib/ib_ipoib.ko license: Dual BSD/GPL description: IP-over-InfiniBand net driver author: Roland Dreier srcversion: 8E47481E21B330BFE32B7CE depends: ib_cm,ipv6,ib_core,ib_sa vermagic: 2.6.18-128.1.1.el5.530g0000 SMP mod_unload gcc-4.1 parm: max_nonsrq_conn_qp:Max number of connected-mode QPs per interface (applied only if shared receive queue is not available) (int) parm: set_nonsrq:set to dictate working in none SRQ mode, otherwise act according to device capabilities (int) parm: mcast_debug_level:Enable multicast debug tracing if > 0 (int) parm: send_queue_size:Number of descriptors in send queue (int) parm: recv_queue_size:Number of descriptors in receive queue (int) parm: debug_level:Enable debug tracing if > 0 (int) module_sig: 883f35049c0555e56ccec1c0ba19c3112535c09b5f5dbc8607465f947d60f2be7fa26132d43309f5dc241bebfe2f2f88fc7c93fbe5ea12cd721a59 [r...@scyld ~]# However, after retesting with ib_rdma_bw I can see that even the verbs layer is not working. I have not tried using the ib_ipoib.ko from the 'working' configuration in the non-working system since I assumed it would not load due to the slight kernel difference. It should be noted that the I have four nodes that fail and nearly 20 that 'work'. The failing nodes are running the same kernel (2.6.18-128.el5) while the working nodes are running the 2.6.18-128.1.1.el5 kernel. I am at a loss as to how to proceed with debugging this short of getting the latest OFED distro and building it. Has anyone else run into this problem and if so, how did you get around it? TIA R. _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
