We were able to get some more test time on the cluster.  Our latest findings 
are 
below.

> The main issue that we saw was that the SA simply doesn't scale.

 From what we could see, it didn't appear that _any_ path record queries were 
ever lost, even when scaling up to 500,000+ requests.  As long as the query 
timeouts were large enough (dependent on process count), our tests would finish 
within a reasonable time, and without retrying queries.  If the timeout values 
were too small, the SA would form a backlog of timed out requests.

With 1024 processes trying to establish all to all connections, it would take 
about 30 seconds for all nodes to complete path record queries.  The SA was 
able 
to sustain about 17,000 queries per second.

>>Was the issue with address resolution being ARP request or reply 
>>messages getting lost?

We only just started looking into this when we were bumped off the cluster.  In 
our initial peek at this, it looked like either the ARP requests or replies 
were 
being discarded on transmit.  Simply increasing the ARP cache timeout fixed 
most 
of the problems for us.

> The disconnect delay occurred because of remote nodes being slow to respond 
> to 
> disconnect requests.  We're still investigating this issue.

This was a DAPL issue.

- Sean

_______________________________________________
openib-general mailing list
[email protected]
http://openib.org/mailman/listinfo/openib-general

To unsubscribe, please visit http://openib.org/mailman/listinfo/openib-general

Reply via email to