Re: Graceful handling of garbage collecting servers?

Willy Tarreau Sat, 27 Oct 2012 00:02:43 -0700

On Wed, Oct 24, 2012 at 04:54:57PM +0200, Finn Arne Gangstad wrote:
> On Tue, Oct 23, 2012 at 4:02 PM, Mariusz Gronczewski <[email protected]> 
> wrote:
> > 2012/10/23 Thomas Heil <[email protected]>:
> >> Hi,
> >>
> >> On 23.10.2012 13:55, Finn Arne Gangstad wrote:
> >>>
> >>> Each request is a reasonably simple GET request that typically takes
> >>> 10-20ms to process. This works great until a server needs to GC, then
> >>> the query will hang for a few seconds.
> >> Iam not quit sure, but I think you can play with timeout server and
> >> option redispatch and retries, so that when GC occours the request would be
> >> redispatched to the next server in the backend.
> >>
> > Try using "balance leastconn", if server will slow down/halt because
> > of GC his queue will quickly be higher than rest of servers and new
> > request will hit non-GCing ones, only disadvantage is that servers
> > which respond faster will on average get more requests but that can be
> > a good thing, if for any reason (backup, system update etc.) one of
> > servers will start answering slower it will automatically get less
> > requests.
> 
> leastconn helps slighty in this particular situation, we'd lose maybe
> 6-7 queries
> instead of 10 (depending a bit on the load), but we still lose queries and we
> don't want to lose any queries at all. Any query that takes more than a second
> or two is effectively lost.


Well, here you have a tuning problem then. If your GC lasts longer than the
maximum critical response time, then you need to correctly tune your JVM to
ensure GCs are more common and take less time.

> haproxy doesn't currently support resubmitting a query, but it would be very
> nice if it could do something along the nginx feature "proxy_next_upstream".
> nginx lets you resubmit a query until you have started sending data back
> to the client, haproxy only lets you resubmit until a connection to the
> backend server has been established.

No, believe me, this must *absolutely not* be done. HTTP provides no way to
abort a request that was started, nor to know whether a request has been
completed. Doing so is explicitly forbidden in the HTTP spec for a good
reason. What you describe caused a coworker to receive two books he ordered
online (and obviously he paid twice). Only the client is allowed to decide
whether or not to replay a non-idempotent request.

Also, I know about a very large web site which was using the weblogic apache
plugin for load balancing. It was doing exactly what you describe here. The
end result is that when things start to go wrong, you have a domino effect
because the killer request is sent to all servers in turn. Very commonly,
the bottleneck aren't the frontend web servers, but the backend database.
By replaying a request, you really multiply the load on this database by
as much, and you redistribute that extra response time to all the servers,
which at one point might all experience too long a response time.

So really, you need to tune the GC. Pausing several seconds is not acceptable
in my opinion. I work with people who use a lot of Java applications, and
I've seen them spend as much time on tuning the JVM as they spend writing
the code, and the result is really worth it. In your case, maybe a 50ms
pause every 10s will remain unnoticed for example.

Regards,
Willy

Re: Graceful handling of garbage collecting servers?

Reply via email to