Hi Reinout, On Fri, Dec 28, 2012 at 07:46:42PM +0100, Reinout Verkerk | Trilex wrote: > Hi Jonathan, > > Thank you for your reply, but I thought about that, but I can not find any > evidence that that's the case. If I look at my last log, I restarted > haproxy (after upgrading to the latest dev17) at 18:20LT. The first warning > after that is the replication check failing at 19:37. Loglevel is set to > notice, maybe that has something to do with it? > > Dec 28 18:20:03 localhost haproxy[19206]: Proxy db02_replication started. > Dec 28 19:37:13 localhost haproxy[19207]: Server db02_replication/db02 is > DOWN, reason: Layer7 check passed, code: 200, info: "OK", check duration: > 25ms. 0 active and 0 backup servers left. 0 sessions active, 0 requeued, 0 > remaining in queue.
That's amazingly strange indeed. I suspect that a network error occurs from time to time, is reported to the upper layers after the response appears. For example, if the response is sent without reading the request, sometimes a reset will be emitted. It's possible that haproxy sometimes gets two consecutive recv() calls then : - one containing the 200 OK response - a reset The reset would be translated into an error, but it's surprizing that the contents are correctly decoded. Is it possible to take a network capture of this health check when it fails ? Please use "tcpdump -s0 -npi eth0 -w trace.cap" for this to ensure that packets are not truncated, and then once you get a log, you can filter on the interesting session. This will greatly help understand what condition causes this issue and then we can see how to fix or work around it. Regards, Willy

