Mark, On 2/21/2011 10:21 AM, Mark Thomas wrote: > The ASF Sonar installation managed to generate 46GB of identical log > messages [1] today in the 8 hours it took to notice it was down.
Ugh. Yeah, that sucks. Hitting the ulimit doesn't necessarily mean disaster and the process could still recover. For instance, an out-of-control request might try to open files in a loop and tie up all the fds for a minute or two before the request dies and the GC cleans everything back up. In the meantime, some request processors might suffer but eventually recover. > While better monitoring would/should have identified the problem sooner, > it does demonstrate a problem with the acceptor threads in all three > endpoints. If there is a system-level issue that causes the accept() > call to always fail (such as hitting the ulimit) then the endpoint > essentially enters a loop where it logs an error message for every > iteration of the loop. This will result in many log messages per second. I like this idea. Does that mean that you'll get once-per-second (or whatever) errors across all processor threads or once-per-second errors for each processor thread? > + if (errorDelay == 0) { > + errorDelay = 50; How about making this initial delay configurable? > + } else if (errorDelay < 1600) { How about making this maximum delay configurable? > + errorDelay = errorDelay * 2; Geometric growth seems reasonable... I don't see a reason to make this multiplier configurable. -chris
signature.asc
Description: OpenPGP digital signature