>This is encouraging. I did try testing with 10,000 ms timeouts and still got >the failure with only 800 different processes, so I jumped to the conclusion >that the queries were being dropped. Do you have a guess as to a timeout value >that would always succeed?
We ended up around a 60 second timeout based on the number of connections and how quickly our SM node could process queries. This was done a while ago, and there have been a lot of improvements to opensm since then. I don't know of an easy way to test the performance of the SM. It's also possible that our test staggered the queries just enough that the SM could keep up receiving them. >Maybe I should have come up with a better name. By fabric-specific, I meant a >specific implentation of the fabric, including the capability of the subnet >manager node. How does somebody writing rdma_cm code come up with a number? >That particular program might not put much of a load on the SA, but could run >concurrently with other jobs that do (or don't). It would be nice to have a >way to set up the retry mechanism so that it would work on any system it ran >on. Maybe the SA service could track the SA response time and adjust the timeout accordingly. E.g. guess = .2(last response) + .8(last guess). Users could indicate that the default timeout could be used. Apps could also help by staggering their start times to avoid hitting the SA with hundreds of thousands of queries at once. - Sean _______________________________________________ general mailing list general@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general