Hi Jean-Baptiste and Robbie, I could finally reproduce the issues with the 502 errors. The only way to get them is to install the backend server on the same machine as haproxy, in order to minimise latency so that the system does not have the time to ACK the data...
So, I checked with strace what haproxy received. epoll_wait() returned EPOLLIN|EPOLLERR|EPOLLHUP. When EPOLLERR is here, haproxy marks the socket as in error state and stops there. I tried to remove the test and observed that recv() could indeed receive the last data block. So I have definitely removed it, and we'll rely on recv()'s return code to detect whether the error is real or not. I think this could be improved so that we immediately report the error if there is only EPOLLERR and no other flag, but this really adds no value, so let's keep it simple. While trying to reproduce the issue, I discovered something I talked to Jean-Baptiste about the shutdown(SHUT_WR) and the close(). In theory there is no need for the shutdown when you're about to close() because close() does it itself. However, the shutdown() was still forced when option nolinger was set, so that we ensure the data correctly leave the system. But I observed that the same problem we're trying to fight against happened when the recv buffer still had data during a close() : the system did not bother sending the last fragment and immediately sent an RST to the remote system. So I changed that so that the shutdown(SHUT_WR) is now always performed. That way, last data segment is pushed outside just before the RST, and the remote system has a chance to get it. A future patch might consist in trying to completely flush the socket buffer before doing the close(). That way, the system will not send an RST. The problem is that this can lead to a simple DoS : a client could continuously send data which will be consumed by haproxy before doing a close(). So maybe we should only try to read a big chunk. Well, TCP is sometimes dirty ... :-/ Anyway, I'm attaching the two patches for the two fixes described above. Those have already been queued for 1.3.19. Feel free to give them a try and comment on what you observe. If you prefer to try the next 1.3.19 snapshot, it will appear tomorrow in the snapshot directory, or it can also be generated on the fly by clicking on the last commit's "snapshot" link in the gitweb interface here : http://haproxy.1wt.eu/git?p=haproxy-1.3.git;a=summary (please don't abuse it though) Regards, Willy
>From 77cb0671427453e3bc18ed2232fd7f5da94b6482 Mon Sep 17 00:00:00 2001 From: Willy Tarreau <[email protected]> Date: Tue, 14 Jul 2009 19:21:50 +0200 Subject: [BUG] stream_sock: always shutdown(SHUT_WR) before closing When we close a socket with unread data in the buffer, or when the nolinger option is set, we regularly lose the last fragment, which often contains the error message. This typically occurs when sending too large a request. Only the RST is seen due to the close() (since not all data were read) and the output message never reaches the network. Doing a shutdown() before the close() solves this annoying issue because the data are really pushed before the system sends the RST. (cherry picked from commit 720058cdcbd5285dc4e4a48216b10c9b96000141) --- src/stream_sock.c | 13 +++++++++---- 1 files changed, 9 insertions(+), 4 deletions(-) diff --git a/src/stream_sock.c b/src/stream_sock.c index 82bf8ca..80d4996 100644 --- a/src/stream_sock.c +++ b/src/stream_sock.c @@ -787,11 +787,16 @@ void stream_sock_shutw(struct stream_interface *si) switch (si->state) { case SI_ST_EST: - if (!(si->ib->flags & BF_SHUTR)) { - EV_FD_CLR(si->fd, DIR_WR); - shutdown(si->fd, SHUT_WR); + /* we have to shut before closing, otherwise some short messages + * may never leave the system, especially when there are remaining + * unread data in the socket input buffer, or when nolinger is set. + */ + EV_FD_CLR(si->fd, DIR_WR); + shutdown(si->fd, SHUT_WR); + + if (!(si->ib->flags & BF_SHUTR)) return; - } + /* fall through */ case SI_ST_CON: /* we may have to close a pending connection, and mark the -- 1.6.0.3
>From 53c802f682d6634663badd0999f932c361ee66f6 Mon Sep 17 00:00:00 2001 From: Willy Tarreau <[email protected]> Date: Tue, 14 Jul 2009 19:55:05 +0200 Subject: [BUG] stream_sock: don't stop reading when the poller reports an error As reported by Jean-Baptiste Quenot and Robbie Aelter, sometimes a backend server error is converted to a 502 error if the backend stops before reading all the request. The reason is that the remote system sends a TCP RST packet because there are still unread data pending in the socket buffer. This RST is translated as a socket error on the local system, and this error is reported by the poller. However, most of the time, it's a write error, but the system is still able to read the remaining pending data, such as in the trace below : send(7, "GET /aaa HTTP/1.0\r\nUser-Agent: Mo"..., 1123, MSG_DONTWAIT|MSG_NOSIGNAL) = 1123 epoll_ctl(3, EPOLL_CTL_ADD, 7, {EPOLLIN, {u32=7, u64=7}}) = 0 epoll_wait(3, {{EPOLLIN|EPOLLERR|EPOLLHUP, {u32=7, u64=7}}}, 8, 1000) = 1 gettimeofday({1247593958, 643572}, NULL) = 0 recv(7, "HTTP/1.0 400 Bad request\r\nCache-C"..., 7000, MSG_NOSIGNAL) = 187 setsockopt(6, SOL_TCP, TCP_NODELAY, [0], 4) = 0 setsockopt(6, SOL_TCP, TCP_CORK, [1], 4) = 0 send(6, "HTTP/1.0 400 Bad request\r\nCache-C"..., 187, MSG_DONTWAIT|MSG_NOSIGNAL) = 187 shutdown(6, 1 /* send */) = 0 The recv succeeded while epoll_wait() reported an error. Note: This case is very hard to reproduce and requires that the backend server is reached via the loopback in order to minimise latency and reduce the risk of sent data being ACKed. (cherry picked from commit 7154365cc60b124b543db4e98faedc75c0f3a2cb) --- src/stream_sock.c | 11 ++++++++--- 1 files changed, 8 insertions(+), 3 deletions(-) diff --git a/src/stream_sock.c b/src/stream_sock.c index 80d4996..a6603a3 100644 --- a/src/stream_sock.c +++ b/src/stream_sock.c @@ -234,8 +234,13 @@ int stream_sock_read(int fd) { retval = 1; - /* stop immediately on errors */ - if (fdtab[fd].state == FD_STERROR || (fdtab[fd].ev & FD_POLL_ERR)) + /* stop immediately on errors. Note that we DON'T want to stop on + * POLL_ERR, as the poller might report a write error while there + * are still data available in the recv buffer. This typically + * happens when we send too large a request to a backend server + * which rejects it before reading it all. + */ + if (fdtab[fd].state == FD_STERROR) goto out_error; /* stop here if we reached the end of data */ @@ -648,7 +653,7 @@ int stream_sock_write(int fd) #endif retval = 1; - if (fdtab[fd].state == FD_STERROR || (fdtab[fd].ev & FD_POLL_ERR)) + if (fdtab[fd].state == FD_STERROR) goto out_error; /* we might have been called just after an asynchronous shutw */ -- 1.6.0.3

