Eitan Zahavi wrote:
[EZ] Having N^2 messages is not a big problem if they do not all go one
target...
CM is distributed and this is good. Only the PathRecord section of the
connection establishment is going today to one node (SA) and you are
about to fix it...
I expect that we'll start having issues scaling when the number of nodes starts
to exceed the size of the CM's QP. Your idea below should help.
During initial connections setup you will not have anything in the SA
cache and thus the SA will need to answer N^2 PathRecords. Smart
exponential back-off can resolve that DOS attack on the SA at bring-up.
I'll post the code for the cache once I complete my testing, but it issues a
single query to fill the cache. The SA will only see O(n) requests. The cache
also supports an update delay, or settle time, and minimum update time to
prevent spamming the SA with back to back requests.
[EZ] We might need a little more in the key for QoS support (to come).
This would need to be exposed through our APIs as well. Alternate paths are
also not yet supported.
[EZ] I would try and make sure the connections are not done in a manner
such that all nodes try to establish connections to a single node at the
same time. This is an application issue but can be easily resolve.
I agree.
[EZ] I think a centralized CM is a only going to make things worse.
It can reduce the number of messages on the network from O(n^2) to O(n). The
idea is that instead of all nodes sending connection requests to all other
nodes, they send a single connection request -- containing an array of QP
information -- to one node. (The array could be sent over an established
connection, rather than in MADs.) The amount of traffic to that one node should
be only slightly worse than the all to all case.
- Sean
_______________________________________________
openib-general mailing list
[email protected]
http://openib.org/mailman/listinfo/openib-general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general