Hi Todd,
So you agree we will need to design "replica" buildup scalability features into
the solution ( to avoid the bring-up load on the SA) ?
Why would a caching system not work here? Instead of replicating the data.
The caching concept allows for the SA to still be in the loop by invalidating
the cache or through cache entries lifetime policy.
The reason I think a total replica (distribution of the SA) would eventually be
problematic is that as we approach QoS solutions,
some need for path record use and retirement is going to show up. What if the
SM decides to change SL2VL maps due to new QoS requirement.
We will need a more complicated "synchronization" or invalidation technique to push that
kind of data into the "replica" SAs.
Eitan
Rimmer, Todd wrote:
From: Eitan Zahavi [mailto:[EMAIL PROTECTED]
Hi Sean, Todd,
Although I like the "replica" idea for its "query"
performance boost - I suspect it will actually do not scale
for very large
networks: Each node has to query for the entire database
would cause N^2 load on the SA.
After any change (which do happen with higher probability on
large networks) the SA will need to send each Report to N targets.
We already have some bad experience with large clusters SA
query issues, like the one reported by Roland
"searching for SRP targets using PortInfo capability mask".
Our experience has been the exact opposite.
While there is an initial load on the SA to populate the replica (which we have
used various techniques to reduce such as backing off when the SA reports Busy,
having a random time offset of start of query, etc). The boost occurs when a
new application starts, such as an MPI using the SA/CM to establish connections
as per the IBTA spec. A 1000 process MPI job would have each process make 999
queries to the SA at job startup time. This causes a burst of 999,0000 sets of
SA queries (most will involve both Node Record and Path record queries so it
will really be 2x this amount), BEFORE the MPI job can actually start.
As Open IB moves forward to implement QOS and other features, MPI will have to
use the SA to get its path records. If you study MVAPICH at present, it merely
exchanges LIDs between nodes and hardcodes (or via enviornment variables uses
the same value for all processes) all the other QOS parameters. In a true QOS
and congestion management environment it will instead have to use the CM/SA.
We have been using this replica technique quite successfully for 2-3 years now.
Our MPI has used the SA/CM for connection establishment for just as long.
As it was pointed out, most fabrics will be quite stable. Hence having a
replica and paying the cost of the SA queries once will be much more efficient
than paying that cost on every application startup.
Todd Rimmer
_______________________________________________
openib-general mailing list
[email protected]
http://openib.org/mailman/listinfo/openib-general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general