On Fri, Mar 12, 2010 at 10:49:07PM +0100, Cyril Bonté wrote:
> Le Vendredi 12 Mars 2010 20:17:14, Willy Tarreau a écrit :
> > Hi Cyril,
> >
> > On Fri, Mar 12, 2010 at 05:03:15PM +0100, Cyril Bonté wrote:
> > > Hi Willy,
> > >
> > > Our monitoring scripts use the unix socket to get haproxy's status.
> > > Sometimes they detect haproxy DOWN when it's not really the case.
> >
> > Do you know how much time it takes to observe it ? I'm currently
>
> It's random but generally I don't wait more than 10 seconds (less than 1500
> loops at home).
OK. Here nothing after one hour, I stopped it.
> > running it on 1.4.0-3 here. It's been running for the last 10 minutes
> > with 2, then 3 and now 10 concurrent scripts. For the record in case
> > that matters, it's running on socat 1.6.0.0 :
> > # /root/bin/socat -V
> > socat by Gerhard Rieger - see www.dest-unreach.org
> > socat version 1.6.0.0 on Oct 28 2007 21:29:34
>
> At work it should be version 1.6.0.1 (debian lenny package)
> Tonight, my tests are done with the version 1.7.1.2.
OK. I feared you'd have a much older one that no one knows
about :-)
> > running on Linux version #1 Sun Jan 31 00:55:16 CET 2010, release
> > 2.4.37-wt3-fw, machine i686
>
> I don't think this makes big differences but my tests were done with
> 2.6.{18,24,31,33} kernels.
OK so you have two different schedulers there, which means different
access patterns.
> > Not much as most of the rework happened between 1.3 and 1.4. In fact,
> > some part of the work also happened between 1.3.15 and 1.3.16 but it
> > was the low-level I/O which is now common with TCP/HTTP. It would be
> > nice to try with "strace socat" instead of "socat" alone. I wonder
> > if it's just a scheduling issue sometimes causing socat to close its
> > output channel after sending the request and before receiving the
> > response (as we commonly have with netcat).
>
> This might be the case but then it's strange that it doesn't happen with non
> concurrent accesses.
most likely because of a race condition : when there's a single one,
as soon as the request is sent, it's received, processed and the response
is sent. With two concurrent processes, it's possible that sometimes one
of them has the time to send the request and the shutdown before the
response is received and processed.
> read(0, "show info\n", 8192) = 10
> write(3, "show info\n", 10) = 10
> select(4, [0 3], [3], [], NULL) = 2 (in [0], out [3])
> read(0, "", 8192) = 0
> shutdown(3, 1 /* send */) = 0
> select(4, [3], [], [], {0, 500000}) = 1 (in [3], left {0, 499998})
> read(3, "", 8192) = 0
OK thanks a lot, this is what I wanted to see. Now at least I know where
to look at :-)
Cheers,
Willy