Or Gerlitz wrote: > Can be very nice if you share with the community the IB stack issues > revealed under scale-out testing... basically what was the testbed?
We have a 256 node (512 processors) cluster that we can test with on the second Tuesday following the first Monday of any month with two full moons. We're only now getting some time on the cluster, and our test capabilities are limited. The main issue that we saw was that the SA simply doesn't scale. > From what the patch does I understand you attempt to handle timeout on > address and route resolution and long disconnect delay. correct > Was the issue with address resolution being ARP request or reply > messages getting lost? This appears to be the case. During test startup, we try to form all to all connections. As we scaled, the number of address resolutions that timed out also increased. We suspect that this is a result of the ipoib broadcast channel getting hit with a 100,000+ requests. > Was the issue with route resolution being timeout on SA Path queries? Yes - but the issues are more complex than that. The SA was able to respond to 4000-6000 queries per second. With an all to all connection model, it gets about 130,000 requests. Assuming that none of these are lost and a 4 second timeout, it will be able to respond only a fraction of the original requests in time. The next 100,000+ requests that it responds to have already timed out before it can send the response. At 5000 queries per second, it will take the SA nearly 30 seconds to respond to the first set of requests, most of which will have timed out. By the time it reached the end of the first 130,000 requests, it had hundreds of thousands of queued retries, most of which had also already timed out. (E.g. even with a exponential backoff, you'd have retries at 4 seconds, 12 seconds, and 28 seconds before the SA can finish processing the first set of requests.) To further complicate the issue, retried requests are given new transaction IDs by the ib_sa module, which makes it impossible for the SA to detect retries from original requests. It sees all requests as new. On our largest run, we were never able to complete route resolution. We're still exploring possibilities in this area. > Was the issue with disconnect delay that peer A called > dat_ep_disconnect() (ie sending DREQ) and the DREP was sent only when > peer B got the disconnect event and called dat_ep_disconnect()? so now > the DREP is sent from within the provider code when it gets the DREQ? The disconnect delay occurred because of remote nodes being slow to respond to disconnect requests. We're still investigating this issue. - Sean _______________________________________________ openib-general mailing list [email protected] http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
