Hi Brendon,

On Wed, Dec 26, 2012 at 03:09:37PM -0500, Brendon Colby wrote:
> Greetings!
> 
> (Apparently GMail IPs are now listed in SORBs, so when I first sent this 
> through the Gmail web interface I got a bounce; I had to use Thunderbird 
> and IMAP.)
> 
> We just replaced our old commercial load balancer (end of life) with 
> haproxy running on an HP DL360p G8 server with a quad-core Xeon E5-2609 
> 2.4GHz processor (soon to be two for redundancy). Our site is mainly 
> media - SWF files of movies and games, mp4 movies, audio, art, etc. We 
> are serving approximately 2,330 req/s right now, for example, at over 
> 1Gbps of outgoing traffic (we average 600Mbps to 1Gbps+ daily). 
> Everything has been running great for a couple of weeks, and now we're 
> just tidying up a few things like logging and these request error counts 
> we've been seeing. We have multiple sub-domains: one hosting css, one 
> hosting images, etc.
> 
> My goals have been to:
> 
> 1. Figure out what's causing the request errors and see if we can tweak 
> something on the server side to stop them
> 2. Reduce the error log output to a manageable level, so we can keep it 
> enabled all the time and monitor it
> 
> We haven't been able to keep full error logging enabled due to the 
> volume of errors being logged (200-300 per second).
> 
> When we first enabled error logging it was flooded with entries like these:
> 
> Dec 21 21:00:01 localhost haproxy[16034]: x.x.x.x:50714 
> [21/Dec/2012:20:59:56.140] http-in http-in/<NOSRV> -1/-1/-1/-1/12343 400 
> 212 - - CR-- 1913/1903/0/0/0 0/0 "<BADREQ>"
> 
> This isn't the exact error but close - a CR error with a seemingly 
> random timeout. "show errors" reported nothing of value.
> 
> Here's what our defaults looked like:
> 
> defaults
>     backlog 10000
>     mode http
>     # This option helps provide better statistics / graphs.
>     option contstats
>     option splice-auto
>     timeout client 30s
>     timeout connect 10s
>     timeout server 30s
> 
> I did some research and it seemed to me that these errors were simply 
> the browser closing connections. For example, when I opened our site 
> with Chrome, after about 12 seconds I would see several CR errors in the 
> logs from our office IP. I figured these must be keep-alive connections 
> (or something like that) that were just timing out on the client side 
> (hence the CR or client side error).
> 
> I put in a "timeout http-request" of five seconds and haproxy then 
> started logging cR errors with a 5000ms timeout value (the same log 
> values above essentially, with cR instead of CR). What this told me was 
> that now instead of the client disconnecting (CR) haproxy was 
> proactively disconnecting (cR) and throwing a 408 error, which made 
> sense. Right now http-request is set to 15 seconds and we're still 
> seeing 100s of errors per second.
> 
> Next, I read through almost the entire haproxy manual (very good docs!) 
> and found this section under "timeout http-keep-alive": "There is also 
> another difference between the two timeouts : when a connection expires 
> during timeout http-keep-alive, no error is returned, the 
> connection just closes. If the connection expires in "http-request" 
> while waiting for a connection to complete, a HTTP 408 error is returned."
> 
> I thought for sure using "timeout http-keep-alive 1" would stop the 
> cR/408 errors, but it didn't. "option dontlognull" stops them from being 
> logged, but I see we're still getting 200-300 request errors per second 
> on the frontend.
> 
> The docs say not to use "option dontlognull" because it could mask 
> attacks and such. I agree with this and don't want to leave this in. 
> What's puzzling is that these cR/408 errors seem to be coming from 
> regular site traffic and browser behavior, not an attack.
> 
> Unless I'm mistaken, the way we have haproxy configured right now, we 
> shouldn't be seeing these cR/408 errors. I will post the relevant pieces 
> of our configuration below. Please let me know if I'm missing something 
> here, because at this point I'm stuck!

Thank you for the very well detailed analysis. I believe that some browsers
nowadays tend to proactively establish connections to visited websites, just
in case they will need them later. Since Chrome does everything it can to
reduce page load time, it very likely is something it's doing. This could
explain exactly what you're observing : a new connection over which nothing
is transferred that it closed when the user leaves the site (hence the
random delay).

If you can reproduce the behaviour with your browser, I think that dontlognull
will be your only solution and that we'll have to update the doc to indicate
that browsers have adopted such an internet-unfriendly behaviour that it's
better to leave the option on. What I don't like with proactively opened
connections is that they're killing servers with 10-100 times the load they
would have to sustain and that even small sites might experience issues with
this. If you see 200 of them per second and they last 5s on average, it means
you're constantly having 5000 idle connections just because of this. Many
web servers can't handle this :-/

BTW, your config is really clean, I have nothing to suggest. I wouldn't
be surprized if some people reuse it to build their own configs :-)

Best regards,
Willy


Reply via email to