Hi Bryan,
On Thu, Dec 06, 2012 at 10:10:18AM +0100, Bryan Berry wrote:
> It does stay high, here is a graph of cpu performance over the last 24
> hours, the left-hand side are % of CPU time
> https://docs.google.com/open?id=0BzPvBvLIIq7NV0QtTkliM3Yxenc
OK so since the graph does not commonly show 100%, I think that in
practice it's oscillating quickly between 100 and zero and is averaged
on the graph.
> The high cpu usage doesn't appear to correlate to any HTTP 500 status codes
> and I wouldn't expect it to since it seems related to the TCP mode proxying
> of our databases.
At first glance in your trace, all sessions seem correct, so I suspect that
this is related to the TCP checks. Baptiste encountered a similar issue with
another user in TCP mode with raw TCP checks. I'll see if I can reproduce any
such issue and/or find an explanation.
In the mean time, if you're adventurous enough to try to disable checks on
TCP servers to see if the problem disappears, that could help.
> just by playing w/ strace, it looks like the following function is being
> called over and over again with a value of 0 for wait_time
>
> status = epoll_wait(0, {}, 26, 0)
>
> Line 133, ev_epoll.c
This is normal, it's the polling loop. The zero wait_time means that there
is one FD that is still active in the cache. Unfortunately, it sounds like
this FD is not attached anymore to a session.
> hope this helps! thanks again for your assistance
Yes it helps, thank you very much. I'm now back to trying to understand
what is happening, and will keep you updated. In case you're volunteer
for more intrusive debugging (eg: with gdb), I might have a few tests
to suggest. But I don't want to abuse, I understand that it's a production
platform.
Best regards,
Willy