Hi Annika,

On Mon, Dec 09, 2013 at 09:30:01AM +0000, Annika Wickert wrote:
> Hi everybody,
> 
> we have a few regarding load at our Haproxy 1.5-dev19 cluster. We run 
> constantly at a load of 12 - 15 most of it is system load. I started to do 
> debugging with strace and see constantly the following message:
> epoll_ctl(0, EPOLL_CTL_ADD, 1541, {EPOLLIN|0x2000, {u32=1541, u64=1541}}) = 0
> epoll_ctl(0, EPOLL_CTL_ADD, 1032, {EPOLLIN|0x2000, {u32=1032, u64=1032}}) = 0
> epoll_wait(0, {{EPOLLIN|0x2000, {u32=3685, u64=3685}}}, 200, 0) = 1
> splice(0xedb, 0, 0xf09, 0, 0x72b0, 0x3) = -1 EAGAIN (Resource temporarily 
> unavailable)
> recvfrom(5110, 0xaeb50a4, 8192, 0, 0, 0) = -1 EAGAIN (Resource temporarily 
> unavailable)
> 
> On our old cluster i do not see any of the "Resource temporarily unavailable?
> at splicing operation. 

They are normal and not errors. It's the way the kernel uses to inform the
userspace that there's nothing more to read on the socket and that userspace
should poll to be notified when more data are available. I think the reason
why you didn't see them on your previous cluster could be because of an old
minor bug in the way haproxy used to enable splice(). It always started with
a first recvfrom() before going on with splice(). So only recvfrom() was
reporting this. But it was suboptimal because if the sender was writing faster
than the reader (which generally is the case), it could have been quite hard
to find a window where the buffer is empty and we can enable splicing. And the
fact that you see it now indicates that splice triggers much quicker than it
used to.

> Could this lead to such a performance impact?

No, quite the opposite in fact. With recent kernels (>= 3.5) and smart
enough NICs, splice() is much more efficient than recv+send to move data
between two NICs. Also by avoiding copying data, you avoid polluting the
CPU caches, so this is much better for many reasons.

There are some situations where splice is not interesting. First is when
you don't have LRO/GRO nor TSO/GSO on your NICs, but this is clearly not
your case given that you're running that on servers, and the second case
is when you're transferring very small objects where the extra logics in
the splice() call is not offset by the amount of bytes to copy.

> Has something changed in kernel 3.11.5?

I remember a few things related to splice, but not for TCP so that's
irrelevant. I really think it's just because you have upgraded haproxy
in fact.

> Are there any things which can be tried out at staging cluster to
> break down this problem?

No necessarily, but don't worry, EAGAIN is not a problem and is completely
normal. It's just like EINPROGRESS when sending a connect(). Not a true
error.

By the way, if you rely on splice() heavily for large objects, you may make
efficient use of tune.pipesize in the global section. Haproxy tries to
allocate very few pipes to move the data between spliced sockets, so it
doesn't hurt to have large ones if you run with large TCP windows. I
achieved to reach 40 Gbps with 512kB pipes, which is quite good for not
that large a value.

Best regards,
Willy


Reply via email to