Hi Jean-Baptiste and Robbie,

I could finally reproduce the issues with the 502 errors.
The only way to get them is to install the backend server
on the same machine as haproxy, in order to minimise latency
so that the system does not have the time to ACK the data...

So, I checked with strace what haproxy received. epoll_wait()
returned EPOLLIN|EPOLLERR|EPOLLHUP. When EPOLLERR is here,
haproxy marks the socket as in error state and stops there.

I tried to remove the test and observed that recv() could
indeed receive the last data block. So I have definitely
removed it, and we'll rely on recv()'s return code to detect
whether the error is real or not. I think this could be
improved so that we immediately report the error if there
is only EPOLLERR and no other flag, but this really adds
no value, so let's keep it simple.

While trying to reproduce the issue, I discovered something
I talked to Jean-Baptiste about the shutdown(SHUT_WR) and
the close(). In theory there is no need for the shutdown
when you're about to close() because close() does it itself.
However, the shutdown() was still forced when option nolinger
was set, so that we ensure the data correctly leave the
system. But I observed that the same problem we're trying
to fight against happened when the recv buffer still had
data during a close() : the system did not bother sending
the last fragment and immediately sent an RST to the remote
system.

So I changed that so that the shutdown(SHUT_WR) is now
always performed. That way, last data segment is pushed
outside just before the RST, and the remote system has
a chance to get it.

A future patch might consist in trying to completely flush
the socket buffer before doing the close(). That way, the
system will not send an RST. The problem is that this can
lead to a simple DoS : a client could continuously send
data which will be consumed by haproxy before doing a
close(). So maybe we should only try to read a big chunk.

Well, TCP is sometimes dirty ... :-/

Anyway, I'm attaching the two patches for the two fixes
described above. Those have already been queued for
1.3.19. Feel free to give them a try and comment on what
you observe.

If you prefer to try the next 1.3.19 snapshot, it will
appear tomorrow in the snapshot directory, or it can also
be generated on the fly by clicking on the last commit's
"snapshot" link in the gitweb interface here :

   http://haproxy.1wt.eu/git?p=haproxy-1.3.git;a=summary

(please don't abuse it though)

Regards,
Willy

>From 77cb0671427453e3bc18ed2232fd7f5da94b6482 Mon Sep 17 00:00:00 2001
From: Willy Tarreau <[email protected]>
Date: Tue, 14 Jul 2009 19:21:50 +0200
Subject: [BUG] stream_sock: always shutdown(SHUT_WR) before closing

When we close a socket with unread data in the buffer, or when the
nolinger option is set, we regularly lose the last fragment, which
often contains the error message. This typically occurs when sending
too large a request. Only the RST is seen due to the close() (since
not all data were read) and the output message never reaches the
network.

Doing a shutdown() before the close() solves this annoying issue
because the data are really pushed before the system sends the RST.
(cherry picked from commit 720058cdcbd5285dc4e4a48216b10c9b96000141)
---
 src/stream_sock.c |   13 +++++++++----
 1 files changed, 9 insertions(+), 4 deletions(-)

diff --git a/src/stream_sock.c b/src/stream_sock.c
index 82bf8ca..80d4996 100644
--- a/src/stream_sock.c
+++ b/src/stream_sock.c
@@ -787,11 +787,16 @@ void stream_sock_shutw(struct stream_interface *si)
 
        switch (si->state) {
        case SI_ST_EST:
-               if (!(si->ib->flags & BF_SHUTR)) {
-                       EV_FD_CLR(si->fd, DIR_WR);
-                       shutdown(si->fd, SHUT_WR);
+               /* we have to shut before closing, otherwise some short messages
+                * may never leave the system, especially when there are 
remaining
+                * unread data in the socket input buffer, or when nolinger is 
set.
+                */
+               EV_FD_CLR(si->fd, DIR_WR);
+               shutdown(si->fd, SHUT_WR);
+
+               if (!(si->ib->flags & BF_SHUTR))
                        return;
-               }
+
                /* fall through */
        case SI_ST_CON:
                /* we may have to close a pending connection, and mark the
-- 
1.6.0.3

>From 53c802f682d6634663badd0999f932c361ee66f6 Mon Sep 17 00:00:00 2001
From: Willy Tarreau <[email protected]>
Date: Tue, 14 Jul 2009 19:55:05 +0200
Subject: [BUG] stream_sock: don't stop reading when the poller reports an error

As reported by Jean-Baptiste Quenot and Robbie Aelter, sometimes a
backend server error is converted to a 502 error if the backend stops
before reading all the request. The reason is that the remote system
sends a TCP RST packet because there are still unread data pending in
the socket buffer. This RST is translated as a socket error on the
local system, and this error is reported by the poller.

However, most of the time, it's a write error, but the system is
still able to read the remaining pending data, such as in the trace
below :

send(7, "GET /aaa HTTP/1.0\r\nUser-Agent: Mo"..., 1123, 
MSG_DONTWAIT|MSG_NOSIGNAL) = 1123
epoll_ctl(3, EPOLL_CTL_ADD, 7, {EPOLLIN, {u32=7, u64=7}}) = 0
epoll_wait(3, {{EPOLLIN|EPOLLERR|EPOLLHUP, {u32=7, u64=7}}}, 8, 1000) = 1
gettimeofday({1247593958, 643572}, NULL) = 0
recv(7, "HTTP/1.0 400 Bad request\r\nCache-C"..., 7000, MSG_NOSIGNAL) = 187
setsockopt(6, SOL_TCP, TCP_NODELAY, [0], 4) = 0
setsockopt(6, SOL_TCP, TCP_CORK, [1], 4) = 0
send(6, "HTTP/1.0 400 Bad request\r\nCache-C"..., 187, 
MSG_DONTWAIT|MSG_NOSIGNAL) = 187
shutdown(6, 1 /* send */)               = 0

The recv succeeded while epoll_wait() reported an error.

Note: This case is very hard to reproduce and requires that the backend
server is reached via the loopback in order to minimise latency and
reduce the risk of sent data being ACKed.
(cherry picked from commit 7154365cc60b124b543db4e98faedc75c0f3a2cb)
---
 src/stream_sock.c |   11 ++++++++---
 1 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/src/stream_sock.c b/src/stream_sock.c
index 80d4996..a6603a3 100644
--- a/src/stream_sock.c
+++ b/src/stream_sock.c
@@ -234,8 +234,13 @@ int stream_sock_read(int fd) {
 
        retval = 1;
 
-       /* stop immediately on errors */
-       if (fdtab[fd].state == FD_STERROR || (fdtab[fd].ev & FD_POLL_ERR))
+       /* stop immediately on errors. Note that we DON'T want to stop on
+        * POLL_ERR, as the poller might report a write error while there
+        * are still data available in the recv buffer. This typically
+        * happens when we send too large a request to a backend server
+        * which rejects it before reading it all.
+        */
+       if (fdtab[fd].state == FD_STERROR)
                goto out_error;
 
        /* stop here if we reached the end of data */
@@ -648,7 +653,7 @@ int stream_sock_write(int fd)
 #endif
 
        retval = 1;
-       if (fdtab[fd].state == FD_STERROR || (fdtab[fd].ev & FD_POLL_ERR))
+       if (fdtab[fd].state == FD_STERROR)
                goto out_error;
 
        /* we might have been called just after an asynchronous shutw */
-- 
1.6.0.3

Reply via email to