Re: [ewg] [PATCH] Handling busy responses from the SA

Hefty, Sean Fri, 04 Jun 2010 16:18:43 -0700

> A common method for handling this sort of thing is to randomize
> the retry timeout. It would be a good idea to randomize all timeouts,
> but the BUSY replies should probably randomize over a longer time
> period.
> 
> Randomization prevents nodes in the cluster from self-synchronizing
> and making the load on the SA worse.


I agree that randomization would be nice, but I think we want even more than 
that.  Part of the issues that we've seen with the current implementation is 
that when a large HPC job starts, everyone and their dog sends the SA a query.  
These time out around the same time and get resent, and the SA ends up 
processing a huge number of duplicates.  The mad layer could be a lot more 
intelligent and avoid sending more than a handful (1?) of retries (or even 
initial requests) at a time until some complete.

- Sean
_______________________________________________
ewg mailing list
[email protected]
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

Re: [ewg] [PATCH] Handling busy responses from the SA

Reply via email to