We were able to get some more test time on the cluster. Our latest findings are below.
> The main issue that we saw was that the SA simply doesn't scale. From what we could see, it didn't appear that _any_ path record queries were ever lost, even when scaling up to 500,000+ requests. As long as the query timeouts were large enough (dependent on process count), our tests would finish within a reasonable time, and without retrying queries. If the timeout values were too small, the SA would form a backlog of timed out requests. With 1024 processes trying to establish all to all connections, it would take about 30 seconds for all nodes to complete path record queries. The SA was able to sustain about 17,000 queries per second. >>Was the issue with address resolution being ARP request or reply >>messages getting lost? We only just started looking into this when we were bumped off the cluster. In our initial peek at this, it looked like either the ARP requests or replies were being discarded on transmit. Simply increasing the ARP cache timeout fixed most of the problems for us. > The disconnect delay occurred because of remote nodes being slow to respond > to > disconnect requests. We're still investigating this issue. This was a DAPL issue. - Sean _______________________________________________ openib-general mailing list [email protected] http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
