Sean Hefty wrote:
The cache
is updated using an SA GET_TABLE request, which is more efficient than
sending separate SA GET requests for each path record.
Your assumption is correct. The implementation will contain copies of
all path records whose SGID is a local node GID. (Currently it contains
only a single path record per SGID/DGID, but that will be expanded.)
Taking into account an invalidation window of 15 minutes which you have
mentioned in one of the emails and doing some math i have come into the
following:
For 1k node/port fabric the SM/SA need to xmit a table of 1k paths to
each local SA, where you can embed 3 paths in a MAD, which would take at
least 350 MADs (330 RMPP segments + 20 ACKS). Since we have 1k nodes
there are 350K MADs to xmit, and if we assume xmit is uniform over the
1k seconds (1000 second = 16 minutes & 40 seconds invalidation window)
we require the -----SM to xmit in constant rate of 350k/1k = 350
MADs/sec forever-----. And this is RMPP, so depending on the RMPP impl
it would run into re-transmission of segments or the whole payload. And
each such table takes 90K (350*256) RAM so the SM needs to allow for up
to 90MB of RAM to hold all those tables.
Aren't we creating a monster here??? if this is SA replica which should
work for scale from day one, lets call it this way and see how to reach
there.
> I view MPI as one of the primary reasons for having a cache. Waiting
> for a
> failed lookup to create the initial cache would delay the startup time
> for apps wanting all-to-all connection establishment. In this case, we
> also get the side effect that the SA receives GET_TABLE requests from
> every node at roughly the same time.
Talking MPI, here are few points that seems to me somehow un addressed
in the all-to-all cache design:
+ neither MVAPICH nor OpenMPI are using path query
+ OpenMPI is opening its connections "per demand" that is only if rank I
attempts to send a message to rank J then I connects to J
+ even MPIs that connect all-to-all in an N ranks JOB would do only
n(n-1)/2 path queries, so the load aggregated load on the SA is half
what the all-to-all caching scheme is generating
Or.
_______________________________________________
openib-general mailing list
[email protected]
http://openib.org/mailman/listinfo/openib-general
To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general