Re: Graceful handling of garbage collecting servers?

Willy Tarreau Mon, 29 Oct 2012 03:24:43 -0700

Hi Finn Arne,

On Sat, Oct 27, 2012 at 09:27:55PM +0200, Finn Arne Gangstad wrote:
> > Well, here you have a tuning problem then. If your GC lasts longer than the
> > maximum critical response time, then you need to correctly tune your JVM to
> > ensure GCs are more common and take less time.
> 
> Tuning the GC to give very small pauses is a possibility, but not the only
> valid solution. We have found that using a parallel stop-the-world GC will
> increase the overall throughput with about 30% (e.g. -XX:+UseNUMA
> -XX:+UseParallelGC), but this gives unpredictable multi-second GC pauses.
> As long as we distribute the load over 5+ servers, there are always enough
> servers that are responsive, and we can reduce the number of servers
> by roughly 25%.


Well you have enough servers that are responsive but the non-responsive ones
block random requests so in the end you randomly degrade the quality of service.

> >> haproxy doesn't currently support resubmitting a query, but it would be 
> >> very
> >> nice if it could do something along the nginx feature 
> >> "proxy_next_upstream".
> >> nginx lets you resubmit a query until you have started sending data back
> >> to the client, haproxy only lets you resubmit until a connection to the
> >> backend server has been established.
> >
> > No, believe me, this must *absolutely not* be done. HTTP provides no way to
> > abort a request that was started, nor to know whether a request has been
> > completed. Doing so is explicitly forbidden in the HTTP spec for a good
> > reason. What you describe caused a coworker to receive two books he ordered
> > online (and obviously he paid twice). Only the client is allowed to decide
> > whether or not to replay a non-idempotent request.
> 
> As a general rule you are correct of course, but we have our services split
> into many different categories. For some of them request duplication is not
> so good, but most of them are idempotent.
> 
> We also use haproxy a lot between internal applications for providing robust
> load distribution and failover. In that case we are in control of both clients
> and servers, and we don't have to worry too much about HTTP guarantees.

Then if you control the client, it's safe and easy to have the client retry the
request. This is what is mandated by the HTTP standard by the way.

> It would be good however to get the failover-and-retry functionality done 
> right
> in one place, and haproxy could be such a place.

Really I don't want to enter that game. Most users will have no clue about
the possible consequences of doing this and will think it's fine to do it.
And after that you'll start seeing messages such as "haproxy kills servers
in domino effect" or "haproxy makes you pay twice" on forums.

> > So really, you need to tune the GC. Pausing several seconds is not 
> > acceptable
> > in my opinion. I work with people who use a lot of Java applications, and
> > I've seen them spend as much time on tuning the JVM as they spend writing
> > the code, and the result is really worth it. In your case, maybe a 50ms
> > pause every 10s will remain unnoticed for example.
> 
> We have tried different GC strategies.  I realize we can make the system work
> for some definition of work by reducing throughput and going with a suboptimal
> GC strategy, at the cost of additional servers.

Surely there is an acceptable tradeoff between the lower performance and
making requests time out during extremely long pauses ? I expect that during
a 2 seconds pause, a 3 GHz server running 6 billion cycles or around 20
billion instructions has more than the time to put things in a clean state.
2 seconds for a GC is the time it takes some systems to boot !

> Would you strongly object to a patch that added a feature along these lines:
> If a server goes silent for X ms, immediately send a monitoring query. If the
> monitoring query is not handled within Y ms, flag the server as "stopped",
> redispatch all pending requests on the server. "stopped" flag is removed
> as soon as the server responds to any pending request or a monitoring
> query.

This feature already exists (check "observe l7" and "on-error"). However
the only requests that are redispatched are the ones in the queue, those
already sent are not available anymore.

Regards,
Willy

Re: Graceful handling of garbage collecting servers?

Reply via email to