Hi John-Paul,
On Tue, May 06, 2014 at 11:57:08PM +0200, John-Paul Bader wrote:
> Hey,
>
> I will do more elaborate test runs in the next couple of days.
No problem.
> I will
> create traces with ktrace which is not as nice as strace but at least
> will provide more context. Is there anything in particular you'd be
> interested in like only syscalls?
I tend to think that syscalls should tell us what's happening. Indeed,
FreeBSD and Linux are both modern operating systems and quite close,
so in general, what works on one of them works on the other one without
any difficulty. The only differences here might be :
- kqueue vs epoll
- specific return value of a syscall that we don't handle properly
(eg: we had a few ENOTCONN vs EAGAIN issues in the past)
> Meanwhile I have build haproxy with debug symbols but in the tests I ran
> today, haproxy did not coredump but only went for the 100% CPU way of
> failing where I had to kill it manually. This happened with httpclose
> and with keep-alive so I'd say the problem is not really related to that.
I'm not surprized. If the OS makes a difference, it's in the lower layers,
so what close vs keep-alive may do is only hint the problem to happen more
often.
What I'm thinking about is that it's possible that we don't always properly
consider an error on a file descriptor, then we don't remove it properly
from the list of polled FDs, and that it might be returned by the poller
as active when we think it's closed. At this point, everything can happen :
- loop forever because we get an error when trying to access this fd
and we can't remove it from the polled fd list ;
- crash when we try to dereference the connection which is attached
to this fd.
> Its so sad because before the CPU load suddenly risees, and
> requests/connections aren't handled anymore haproxy performs so well and
> effortless.
>
> Also, if I can help by providing access to a FreeBSD machine, just let
> me know. I have plenty :)
At some point it could be useful, especially if we manage to reproduce
the problem on a test platform.
> If you have any other idea apart from ktrace, coredumps to make
> troubleshooting more effective I'd be more than happy to help.
There's something you can try to see if it's related to what I suspect
above. If you apply this patch and it crashes earlier, it definitely
means that we're having a problem with an fd which is reported after
being closed :
diff --git a/src/connection.c b/src/connection.c
index 1483f18..27bb6c5 100644
--- a/src/connection.c
+++ b/src/connection.c
@@ -44,7 +44,7 @@ int conn_fd_handler(int fd)
unsigned int flags;
if (unlikely(!conn))
- return 0;
+ abort();
conn_refresh_polling_flags(conn);
flags = conn->flags & ~CO_FL_ERROR; /* ensure to call the wake handler
upon error */
If this happens, retry without kqueue, it will use poll and the issue
should not appear, or we have a bigger bug.
Regards,
Willy