Mark,

On 2/21/2011 10:21 AM, Mark Thomas wrote:
> The ASF Sonar installation managed to generate 46GB of identical log
> messages [1] today in the 8 hours it took to notice it was down.

Ugh. Yeah, that sucks.

Hitting the ulimit doesn't necessarily mean disaster and the process
could still recover. For instance, an out-of-control request might try
to open files in a loop and tie up all the fds for a minute or two
before the request dies and the GC cleans everything back up. In the
meantime, some request processors might suffer but eventually recover.

> While better monitoring would/should have identified the problem sooner,
> it does demonstrate a problem with the acceptor threads in all three
> endpoints. If there is a system-level issue that causes the accept()
> call to always fail (such as hitting the ulimit) then the endpoint
> essentially enters a loop where it logs an error message for every
> iteration of the loop. This will result in many log messages per second.

I like this idea. Does that mean that you'll get once-per-second (or
whatever) errors across all processor threads or once-per-second errors
for each processor thread?

> +                        if (errorDelay == 0) {
> +                            errorDelay = 50;

How about making this initial delay configurable?

> +                        } else if (errorDelay < 1600) {

How about making this maximum delay configurable?

> +                            errorDelay = errorDelay * 2;

Geometric growth seems reasonable... I don't see a reason to make this
multiplier configurable.

-chris

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to