On Wed, Jul 15, 2009 at 10:47:11AM +0300, Or Gerlitz wrote: > Isaac Huang wrote: > > [...] bonding device over ib0 and ib2 worked well ib2 as an independent > > IPoIB device couldn't work > > (ICMP pings failed). It was CentOS 5.3, with ib-bonding-0.9.0-28. > > Generally speaking, assigning an IP address and hence a route entry to a > slave is non recommended and
My understanding, which I'd be happy to find false, was that RDMA cmid couldn't be created and bound to a bonding device. If true, then assigning IPs to slaves seemed to be the only way to get ULPs that rely on the RDMA CM API to work, while the master interface provides failover to TCP/IP applications. > doesn't come without pain, e.g see "Potential Sources of Trouble", section > 8.1 "Adventures in Routing" of Documentation/networking/bonding.txt, so your > problem might have nothing to do with IPoIB. What kernel does CentOS 5.3 > comes with? you may be able to use the mainline bonding driver. Thanks for the pointer; our configuration looked good and the slave did not have routes that supersede routes of the master: # ip route show 10.0.0.0/16 dev bond0 proto kernel scope link src 10.0.13.49 10.1.0.0/16 dev ib2 proto kernel scope link src 10.1.13.49 It appeared that all ARP requests over the slave 'ib2' failed, which was why ICMP pings failed: # ip neigh show 10.0.1.111 dev bond0 lladdr 80:00:00:48:fe:80:00:00:00:00:00:10:00:03:ba:00:01:00:fc:05 REACHABLE 10.0.1.101 dev bond0 lladdr 80:00:00:48:fe:80:00:00:00:00:00:10:00:03:ba:00:01:00:fb:05 REACHABLE 10.1.1.112 dev ib2 FAILED 10.1.1.132 dev ib2 FAILED 10.1.1.131 dev ib2 FAILED Rdma_resolve_addr over a cmid bound to the slave also failed with RDMA_CM_EVENT_ADDR_ERROR status -ETIMEDOUT. But tcpdump output on 'ib2' did show the ARP request and response: 15:20:47.571428 arp who-has 10.1.1.132 tell 10.1.13.49 hardware #32 15:20:47.571631 arp reply 10.1.1.132 is-at 80:00:00:49:fe:80:00:00:00:00:00:10:00:03:ba:00:01:00:fb:8a hardware #32 The response seemed to have been dropped by ARP for some reason. The ARP code appears to match responses with outstanding requests on a per-interface basis, and drops responses without a matching request on its incoming interface. When a response arrives on a slave, would it be considered to have been received from the slave interface or its master interface? That seemed to me to be the only place where the responses could be dropped - it worked all fine if bonding was not enabled. Thanks, Isaac _______________________________________________ general mailing list [email protected] http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
