Hi Mark,

OK I could reproduce, debug and fix. It was a tough one, really...
More a problem of internal semantics than anything else, so I had
to test several possibilities and study their impacts and the corner
cases. In the end we get something that's fixed and better :-)

The issue was mostly visible with pipelined keep-alive requests,
but was still happening for tails of transfers that were less than
4kB, and delayed closes. These ones are much more common (about 1/4
of the transfers) but shorter so they remain unnoticed as they just
result in a few more syscalls for a transfer, causing a slight CPU
usage increase (I noticed 10% instead of 6% here with the fix).

So I'm sending you 3 patches. The first one is the most important
one, alone it must fix the issue. The second and third ones are
the least important parts of the fix, but which should still be
applied to save CPU usage in the low numbers. They can be seen as
optimizations but alone they could have seriously addressed the
issue as well, except a few corner cases.

So please apply them all in sequence to your tree. They were made
on latest -master, but they should apply to plain 1.5-dev19 as well.

Oh and by the way, the bug was present since 1.5-dev12.

Best regards,
Willy

>From 33f323dd4089bb62eedcd1cf7c61a155bef5ebd0 Mon Sep 17 00:00:00 2001
From: Willy Tarreau <[email protected]>
Date: Thu, 18 Jul 2013 21:49:32 +0200
Subject: BUG/MEDIUM: splicing: fix abnormal CPU usage with splicing

Mark Janssen reported an issue in 1.5-dev19 which was introduced
in 1.5-dev12 by commit 96199b10. From time to time, randomly, the
CPU usage spikes to 100% for seconds to minutes.

A deep analysis of the traces provided shows that it happens when
waiting for the response to a second pipelined HTTP request, or
when trying to handle the received shutdown advertised by epoll()
after the last block of data. Each time, splice() was involved with
data pending in the pipe.

The cause of this was that such events could not be taken into account
by splice nor by recv and were left pending :

  - the transfer of the last block of data, optionally with a shutdown
    was not handled by splice() because of the validation that to_forward
    is higher than MIN_SPLICE_FORWARD ;

  - the next recv() call was inhibited because of the test on presence
    of data in the pipe. This is also what prevented the recv() call
    from handling a response to a pipelined request until the client
    had ACKed the previous response.

No less than 4 different methods were experimented to fix this, and the
current one was finally chosen. The principle is that if an event is not
caught by splice(), then it MUST be caught by recv(). So we remove the
condition on the pipe's emptiness to perform an recv(), and in order to
prevent recv() from being used in the middle of a transfer, we mark
supposedly full pipes with CO_FL_WAIT_ROOM, which makes sense because
the reason for stopping a splice()-based receive is that the pipe is
supposed to be full.

The net effect is that we don't wake up and sleep in loops during these
transient states. This happened much more often than expected, sometimes
for a few cycles at end of transfers, but rarely long enough to be
noticed, unless a client timed out with data pending in the pipe. The
effect on CPU usage is visible even when transfering 1MB objects in
pipeline, where the CPU usage drops from 10 to 6% on a small machine at
medium bandwidth.

Some further improvements are needed :
  - the last chunk of a splice() transfer is never done using splice due
    to the test on to_forward. This is wrong and should be performed with
    splice if the pipe has not yet been emptied ;

  - si_chk_snd() should not be called when the write event is already being
    polled, otherwise we're almost certain to get EAGAIN.

Many thanks to Mark for all the traces he cared to provide, they were
essential for understanding this issue which was not reproducible
without.

Only 1.5-dev is affected, no backport is needed.
---
 src/raw_sock.c         |  1 +
 src/stream_interface.c | 17 +++++++++++++----
 2 files changed, 14 insertions(+), 4 deletions(-)

diff --git a/src/raw_sock.c b/src/raw_sock.c
index e030253..b9bb8dc 100644
--- a/src/raw_sock.c
+++ b/src/raw_sock.c
@@ -165,6 +165,7 @@ int raw_sock_to_pipe(struct connection *conn, struct pipe 
*pipe, unsigned int co
                        /* We've read enough of it for this time, let's stop 
before
                         * being asked to poll.
                         */
+                       conn->flags |= CO_FL_WAIT_ROOM;
                        break;
                }
        } /* while */
diff --git a/src/stream_interface.c b/src/stream_interface.c
index 8a21d39..90e4044 100644
--- a/src/stream_interface.c
+++ b/src/stream_interface.c
@@ -964,20 +964,29 @@ static void si_conn_recv_cb(struct connection *conn)
                if (conn->flags & CO_FL_ERROR)
                        goto out_error;
 
-               if (conn->flags & CO_FL_WAIT_ROOM) /* most likely the pipe is 
full */
+               if (conn->flags & CO_FL_WAIT_ROOM) {
+                       /* the pipe is full or we have read enough data that it
+                        * could soon be full. Let's stop before needing to 
poll.
+                        */
                        si->flags |= SI_FL_WAIT_ROOM;
+                       __conn_data_stop_recv(conn);
+               }
 
                /* splice not possible (anymore), let's go on on standard copy 
*/
        }
 
  abort_splice:
-       /* release the pipe if we can, which is almost always the case */
-       if (chn->pipe && !chn->pipe->data) {
+       if (chn->pipe && unlikely(!chn->pipe->data)) {
                put_pipe(chn->pipe);
                chn->pipe = NULL;
        }
 
-       while (!chn->pipe && !(conn->flags & (CO_FL_ERROR | CO_FL_SOCK_RD_SH | 
CO_FL_DATA_RD_SH | CO_FL_WAIT_RD | CO_FL_WAIT_ROOM | CO_FL_HANDSHAKE))) {
+       /* Important note : if we're called with POLL_IN|POLL_HUP, it means the 
read polling
+        * was enabled, which implies that the recv buffer was not full. So we 
have a guarantee
+        * that if such an event is not handled above in splice, it will be 
handled here by
+        * recv().
+        */
+       while (!(conn->flags & (CO_FL_ERROR | CO_FL_SOCK_RD_SH | 
CO_FL_DATA_RD_SH | CO_FL_WAIT_RD | CO_FL_WAIT_ROOM | CO_FL_HANDSHAKE))) {
                max = bi_avail(chn);
 
                if (!max) {
-- 
1.7.12.4.dirty

>From 29ae704f58e273ba5f85fda2440b74c4341a9842 Mon Sep 17 00:00:00 2001
From: Willy Tarreau <[email protected]>
Date: Thu, 18 Jul 2013 22:09:48 +0200
Subject: BUG/MINOR: stream_interface: don't call chk_snd() on polled events

As explained in previous patch, we incorrectly call chk_snd() when
performing a read even if the write event is already subscribed to
poll(). This is counter-productive because we're almost sure to get
an EAGAIN.

A quick test shows that this fix halves the number of failed splice()
calls without adding any extra work on other syscalls.

This could have been tagged as an improvement, but since this behaviour
made the analysis of previous bug more complex, it still qualifies as
a fix.
---
 src/stream_interface.c | 8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/src/stream_interface.c b/src/stream_interface.c
index 90e4044..905612c 100644
--- a/src/stream_interface.c
+++ b/src/stream_interface.c
@@ -807,6 +807,14 @@ static void stream_int_chk_snd_conn(struct 
stream_interface *si)
            !(si->flags & SI_FL_WAIT_DATA))       /* not waiting for data */
                return;
 
+       if (si->conn->flags & (CO_FL_DATA_WR_ENA|CO_FL_CURR_WR_ENA)) {
+               /* already subscribed to write notifications, will be called
+                * anyway, so let's avoid calling it especially if the reader
+                * is not ready.
+                */
+               return;
+       }
+
        if (!(si->conn->flags & 
(CO_FL_HANDSHAKE|CO_FL_WAIT_L4_CONN|CO_FL_WAIT_L6_CONN))) {
                /* Before calling the data-level operations, we have to prepare
                 * the polling flags to ensure we properly detect changes.
-- 
1.7.12.4.dirty

>From 29fba9ebe848ac1b055909520bcbaeb2179b6b2f Mon Sep 17 00:00:00 2001
From: Willy Tarreau <[email protected]>
Date: Thu, 18 Jul 2013 22:21:54 +0200
Subject: OPTIM: splicing: use splice() for the last block when relevant

Splicing is avoided for small transfers because it's generally cheaper
to perform a couple of recv+send calls than pipe+splice+splice. This
has the consequence that the last chunk of a large transfer may be
transferred using recv+send if it's less than 4 kB. But when the pipe
is already set up, it's better to use splice() to read the pending data,
since they will get merged with the pending ones. This is what now
happens everytime the reader is slower than the writer.

Note that this change alone could have fixed most of the CPU hog bug,
except at the end when only the close was pending.
---
 src/stream_interface.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/src/stream_interface.c b/src/stream_interface.c
index 905612c..33f1d0e 100644
--- a/src/stream_interface.c
+++ b/src/stream_interface.c
@@ -934,7 +934,8 @@ static void si_conn_recv_cb(struct connection *conn)
         * using a buffer.
         */
        if (conn->xprt->rcv_pipe &&
-           chn->to_forward >= MIN_SPLICE_FORWARD && chn->flags & 
CF_KERN_SPLICING) {
+           (chn->pipe || chn->to_forward >= MIN_SPLICE_FORWARD) &&
+           chn->flags & CF_KERN_SPLICING) {
                if (buffer_not_empty(chn->buf)) {
                        /* We're embarrassed, there are already data pending in
                         * the buffer and we don't want to have them at two
-- 
1.7.12.4.dirty

Reply via email to