Re: Testers wanted : about the stalled POST issues = FOUND
Hi all, On Fri, Dec 14, 2012 at 04:41:44PM +0100, Willy Tarreau wrote: it seems there have been a few reports of stalled POST requests recently, but at this point in time we still have very few information and it's hard to draw a verdict. Just to keep you updated, thanks to the amazing work Sander Klein and John Rood have done at Picturae ICT ( http://picturae.com/ ) we could finally spot the bug after one week of restless digging ! The bug was amazingly hard to reproduce in general and would only affect POST requests under certain circumstances that I never could reproduce despite many efforts. It is likely that other users were affected too but did not notice it because end users did not complain (I'm thinking about webmail and file sharing environments for example). So snapshot 20121220 should be OK. If no more issue gets reported, I'll issue dev16 so that everyone can run something a bit more trustable in their environments. Real kudos to the guys above for their exceptional help! We owe them a big thanks! Willy
Re: Testers wanted : about the stalled POST issues
Hi Sander, On Fri, Dec 14, 2012 at 09:04:39PM +0100, Sander Klein wrote: Hi Willy. On 14.12.2012 16:41, Willy Tarreau wrote: Hi, it seems there have been a few reports of stalled POST requests recently, but at this point in time we still have very few information and it's hard to draw a verdict. After a long code review, I suspect one recent fix for the CPU spinning issues to cause this, though I don't know if the stall issue predates this fix (2012/12/10). If some people experience this issue (large POST requests occasionally stall after a few megabytes for some clients), I would be very happy if they could test the following patch on top of 1.5-dev15 (or the latest snapshot). I suspect that the original fix is not totally correct and may have caused this issue, though it's hard to be certain. The bug is somehow very hard to trigger. But, I did manage to trigger the bug with dev15 a couple of times and I have not been able to trigger it with dev15-and-your-patch. So I think your patch fixes the issue. Thank you very much for testing ! Then I will merge it, at least because it looks more correct to me than the original code, and we'll continue to observe if new bugs are reported. Have a nice week-end, Willy
Re: Testers wanted : about the stalled POST issues
Hi Willy, On 15.12.2012 09:14, Willy Tarreau wrote: The bug is somehow very hard to trigger. But, I did manage to trigger the bug with dev15 a couple of times and I have not been able to trigger it with dev15-and-your-patch. So I think your patch fixes the issue. Thank you very much for testing ! Then I will merge it, at least because it looks more correct to me than the original code, and we'll continue to observe if new bugs are reported. We've been testing a lot more this morning and we still experience POST's that are stalling. But, much less than we had with standard dev15 without your patch. I'm not sure if it is fixed now or we are seeing a second issue either with haproxy or with something else. Greets, Sander
Re: Testers wanted : about the stalled POST issues
Hi Sander, On Sat, Dec 15, 2012 at 10:59:04AM +0100, Sander Klein wrote: We've been testing a lot more this morning and we still experience POST's that are stalling. But, much less than we had with standard dev15 without your patch. I'm not sure if it is fixed now or we are seeing a second issue either with haproxy or with something else. OK, so that means that the issue might still exist in other circumstances. Next time you observe a frozen connection (if you can at all), could you please issue the following command on the socket stats : $ echo show sess all | socat /var/run/haproxy.stat stdio It will really help a lot because it will tell us what state the session is stuck in. I'm auditing all the I/O code right now to find where such weaknesses can appear. It's possible that in the end I'll try to simplify it a it, because code resulting from a transformation generally is very sensible :-/ Thanks for your useful feedback! Willy
Re: Testers wanted : about the stalled POST issues
On Sat, Dec 15, 2012 at 10:59:04AM +0100, Sander Klein wrote: We've been testing a lot more this morning and we still experience POST's that are stalling. But, much less than we had with standard dev15 without your patch. I'm not sure if it is fixed now or we are seeing a second issue either with haproxy or with something else. Could you please try the attached one ? I already have it in my queue as a cleanup patch but maybe it's more than a cleanup. I can imagine a sequence I'm not certain about, which might produce your issue with a very low probability since it would be very timing-dependant. In short, the fd is polled, then switches to speculative I/O in one direction because it leaves some incomplete work behind, the FD is woken up in the other direction, has the POLL_OUT flag set, is not processed because it's marked speculative. Then the other side forwards data, calls stream_int_chk_snd_conn(), causes send() to emit all data at once, which disables the speculative I/O. Next round of forwarding has the polling disabled, but POLL_OUT still set so chk_snd_conn() does not do anything and leaves the FD polling disabled, causing a stall. I'm not totally certain about the exact timings and sequencing required to achieve this case, but I'm not able to demonstrate that it cannot happen. Regards, Willy diff --git a/src/stream_interface.c b/src/stream_interface.c index 32cd211..5f6d85d 100644 --- a/src/stream_interface.c +++ b/src/stream_interface.c @@ -795,8 +795,7 @@ static void stream_int_chk_snd_conn(struct stream_interface *si) return; if (!ob-pipe /* spliced data wants to be forwarded ASAP */ - (!(si-flags SI_FL_WAIT_DATA) ||/* not waiting for data */ -(fdtab[si-conn-t.sock.fd].ev FD_POLL_OUT))) /* we'll be called anyway */ + !(si-flags SI_FL_WAIT_DATA)) /* not waiting for data */ return; if (!(si-conn-flags (CO_FL_HANDSHAKE|CO_FL_WAIT_L4_CONN|CO_FL_WAIT_L6_CONN))) {
Testers wanted : about the stalled POST issues
Hi, it seems there have been a few reports of stalled POST requests recently, but at this point in time we still have very few information and it's hard to draw a verdict. After a long code review, I suspect one recent fix for the CPU spinning issues to cause this, though I don't know if the stall issue predates this fix (2012/12/10). If some people experience this issue (large POST requests occasionally stall after a few megabytes for some clients), I would be very happy if they could test the following patch on top of 1.5-dev15 (or the latest snapshot). I suspect that the original fix is not totally correct and may have caused this issue, though it's hard to be certain. Thanks in advance, Willy diff --git a/src/stream_interface.c b/src/stream_interface.c index 1507dcc..32cd211 100644 --- a/src/stream_interface.c +++ b/src/stream_interface.c @@ -806,7 +806,8 @@ static void stream_int_chk_snd_conn(struct stream_interface *si) if (si-conn-ctrl) fd_want_send(si-conn-t.sock.fd); si-conn-flags = ~(CO_FL_WAIT_DATA|CO_FL_WAIT_ROOM|CO_FL_WAIT_RD|CO_FL_WAIT_WR); - si-conn-flags |= CO_FL_CURR_WR_ENA; + if (fd_ev_is_set(si-conn-t.sock.fd, DIR_WR)) + si-conn-flags |= CO_FL_CURR_WR_ENA; if (si_conn_send_loop(si-conn) 0) { /* Write error on the file descriptor */ -