Re: Testers wanted : about the stalled POST issues = FOUND

2012-12-19 Thread Willy Tarreau
Hi all,

On Fri, Dec 14, 2012 at 04:41:44PM +0100, Willy Tarreau wrote:
 it seems there have been a few reports of stalled POST requests recently,
 but at this point in time we still have very few information and it's hard
 to draw a verdict.

Just to keep you updated, thanks to the amazing work Sander Klein and John
Rood have done at Picturae ICT ( http://picturae.com/ ) we could finally
spot the bug after one week of restless digging ! The bug was amazingly
hard to reproduce in general and would only affect POST requests under
certain circumstances that I never could reproduce despite many efforts.
It is likely that other users were affected too but did not notice it
because end users did not complain (I'm thinking about webmail and file
sharing environments for example).

So snapshot 20121220 should be OK. If no more issue gets reported, I'll
issue dev16 so that everyone can run something a bit more trustable in
their environments.

Real kudos to the guys above for their exceptional help! We owe them a
big thanks!

Willy




Re: Testers wanted : about the stalled POST issues

2012-12-15 Thread Willy Tarreau
Hi Sander,

On Fri, Dec 14, 2012 at 09:04:39PM +0100, Sander Klein wrote:
 Hi Willy.
 
 On 14.12.2012 16:41, Willy Tarreau wrote:
 Hi,
 
 it seems there have been a few reports of stalled POST requests 
 recently,
 but at this point in time we still have very few information and it's 
 hard
 to draw a verdict.
 
 After a long code review, I suspect one recent fix for the CPU 
 spinning
 issues to cause this, though I don't know if the stall issue predates 
 this
 fix (2012/12/10).
 
 If some people experience this issue (large POST requests 
 occasionally stall
 after a few megabytes for some clients), I would be very happy if 
 they could
 test the following patch on top of 1.5-dev15 (or the latest 
 snapshot). I
 suspect that the original fix is not totally correct and may have 
 caused
 this issue, though it's hard to be certain.
 
 The bug is somehow very hard to trigger. But, I did manage to trigger 
 the bug with dev15 a couple of times and I have not been able to trigger 
 it with dev15-and-your-patch. So I think your patch fixes the issue.

Thank you very much for testing !

Then I will merge it, at least because it looks more correct to me than
the original code, and we'll continue to observe if new bugs are reported.

Have a nice week-end,
Willy




Re: Testers wanted : about the stalled POST issues

2012-12-15 Thread Sander Klein

Hi Willy,


On 15.12.2012 09:14, Willy Tarreau wrote:
The bug is somehow very hard to trigger. But, I did manage to 
trigger
the bug with dev15 a couple of times and I have not been able to 
trigger

it with dev15-and-your-patch. So I think your patch fixes the issue.


Thank you very much for testing !

Then I will merge it, at least because it looks more correct to me 
than
the original code, and we'll continue to observe if new bugs are 
reported.




We've been testing a lot more this morning and we still experience 
POST's that are stalling. But, much less than we had with standard dev15 
without your patch. I'm not sure if it is fixed now or we are seeing a 
second issue either with haproxy or with something else.


Greets,

Sander



Re: Testers wanted : about the stalled POST issues

2012-12-15 Thread Willy Tarreau
Hi Sander,

On Sat, Dec 15, 2012 at 10:59:04AM +0100, Sander Klein wrote:
 We've been testing a lot more this morning and we still experience 
 POST's that are stalling. But, much less than we had with standard dev15 
 without your patch. I'm not sure if it is fixed now or we are seeing a 
 second issue either with haproxy or with something else.

OK, so that means that the issue might still exist in other circumstances.

Next time you observe a frozen connection (if you can at all), could you
please issue the following command on the socket stats :

   $ echo show sess all | socat /var/run/haproxy.stat stdio

It will really help a lot because it will tell us what state the session
is stuck in.

I'm auditing all the I/O code right now to find where such weaknesses can
appear. It's possible that in the end I'll try to simplify it a it, because
code resulting from a transformation generally is very sensible :-/

Thanks for your useful feedback!
Willy




Re: Testers wanted : about the stalled POST issues

2012-12-15 Thread Willy Tarreau
On Sat, Dec 15, 2012 at 10:59:04AM +0100, Sander Klein wrote:
 We've been testing a lot more this morning and we still experience 
 POST's that are stalling. But, much less than we had with standard dev15 
 without your patch. I'm not sure if it is fixed now or we are seeing a 
 second issue either with haproxy or with something else.

Could you please try the attached one ? I already have it in my queue as
a cleanup patch but maybe it's more than a cleanup.

I can imagine a sequence I'm not certain about, which might produce your
issue with a very low probability since it would be very timing-dependant.
In short, the fd is polled, then switches to speculative I/O in one direction
because it leaves some incomplete work behind, the FD is woken up in the
other direction, has the POLL_OUT flag set, is not processed because it's
marked speculative. Then the other side forwards data, calls
stream_int_chk_snd_conn(), causes send() to emit all data at once, which
disables the speculative I/O. Next round of forwarding has the polling
disabled, but POLL_OUT still set so chk_snd_conn() does not do anything
and leaves the FD polling disabled, causing a stall.

I'm not totally certain about the exact timings and sequencing required
to achieve this case, but I'm not able to demonstrate that it cannot
happen.

Regards,
Willy

diff --git a/src/stream_interface.c b/src/stream_interface.c
index 32cd211..5f6d85d 100644
--- a/src/stream_interface.c
+++ b/src/stream_interface.c
@@ -795,8 +795,7 @@ static void stream_int_chk_snd_conn(struct stream_interface 
*si)
return;
 
if (!ob-pipe   /* spliced data wants to be 
forwarded ASAP */
-   (!(si-flags  SI_FL_WAIT_DATA) ||/* not waiting for data */
-(fdtab[si-conn-t.sock.fd].ev  FD_POLL_OUT)))   /* we'll be 
called anyway */
+   !(si-flags  SI_FL_WAIT_DATA))   /* not waiting for data */
return;
 
if (!(si-conn-flags  
(CO_FL_HANDSHAKE|CO_FL_WAIT_L4_CONN|CO_FL_WAIT_L6_CONN))) {


Testers wanted : about the stalled POST issues

2012-12-14 Thread Willy Tarreau
Hi,

it seems there have been a few reports of stalled POST requests recently,
but at this point in time we still have very few information and it's hard
to draw a verdict.

After a long code review, I suspect one recent fix for the CPU spinning
issues to cause this, though I don't know if the stall issue predates this
fix (2012/12/10).

If some people experience this issue (large POST requests occasionally stall
after a few megabytes for some clients), I would be very happy if they could
test the following patch on top of 1.5-dev15 (or the latest snapshot). I
suspect that the original fix is not totally correct and may have caused
this issue, though it's hard to be certain.

Thanks in advance,
Willy


diff --git a/src/stream_interface.c b/src/stream_interface.c
index 1507dcc..32cd211 100644
--- a/src/stream_interface.c
+++ b/src/stream_interface.c
@@ -806,7 +806,8 @@ static void stream_int_chk_snd_conn(struct stream_interface 
*si)
if (si-conn-ctrl)
fd_want_send(si-conn-t.sock.fd);
si-conn-flags = 
~(CO_FL_WAIT_DATA|CO_FL_WAIT_ROOM|CO_FL_WAIT_RD|CO_FL_WAIT_WR);
-   si-conn-flags |= CO_FL_CURR_WR_ENA;
+   if (fd_ev_is_set(si-conn-t.sock.fd, DIR_WR))
+   si-conn-flags |= CO_FL_CURR_WR_ENA;
 
if (si_conn_send_loop(si-conn)  0) {
/* Write error on the file descriptor */
-