Michael S. Tsirkin wrote: > It's a problem, I agree, but hard-coding timeouts still does not make sense > to me - I honestly don't see how will an application know which value to > use here, since the roundtrip really depends on the topology.
> Any ideas on how this can be handled correctly? Does CMA at least back off > exponentially on timeout? From our experience on order K nodes cluster, we did not have issues with CM traffic, but: the CM traffic was not NxN but rather NxM where N was (say) 1K and M was (say) 16, the app being cluster file system - Lustre /VIBNAL which is the Lustre IB layer for the voltaire gen1 stack. As for NxN CM/CMA consumers, i recall it has been mentioned on this list that CM timeouts/retries had to be changed to have (say) N=128 nodes (ranks?) operating fine with Intel MPI using uDAPL. Sean - have you been into the loop of analyzing /debugging @ this site? Can you confirm **this** was the issue which made the setup broken and working when you enlarged/changed things (what? and from which value to which value?) Without any relevant (non) use case i don't think there's a need to spend energy on code to generate the correct timeouts/retries for this or that setting. Or. _______________________________________________ openib-general mailing list [email protected] http://openib.org/mailman/listinfo/openib-general To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general
