On Tue, Apr 18, 2017 at 04:19:44PM +0300, Slawa Olhovchenkov wrote:
> Today I am got this issuse on all 3 haproxy different hosts
> simultaneously (about 07:38:57 UTC I am got 'kernel: sonewconn: pcb
> 0xfffff80040f781d0: Listen queue overflow: 1537 already in queue
> awaiting acceptance (564 occurrences)').
OK so you're one of those who got this as well! Cool! (not for you I
know). Till now we only got a few reports from other users who started
to meet the issue after upgrading from FBSD 10 to 11, and all of them
die at the same time. It was 09:38 in our last "prediction" and reported
as such by another person, so I think it was local time probably on UTC+2.
> 2017-04-18 07:38:57 UTC = 1492501137 unix time.
> 1492501137 unix time = 1640 now_ms, i.e. issues cused around now_ms
> wraps.
In fact it's -2147482008 ms, 1640 is if you round it to 31 bits, but
anyway I get your point and we see the relation with the wrapping that
we imagined but still couldn't catch.
> Track down by dtrace show negative timeout.tv_sec/timeout.tv_nsec.
Nice! All truss outputs we got so far were truncated to 3 args so we
could never see the contents of the timeout field.
Now the remaining questions are :
- how is it possible to have such a large timeout in the timer queue
- why doesn't the situation fix by itself on the next millisecond when
it's not supposed anymore to be possible to have such a large timeout ;
- why doesn't it affect other pollers
- why some users experienced a kernel panic quickly after the issue
But at least we're leaning towards something so your report is *very*
valuable, thanks a lot.
> I am don't know how to add loging to ev_kqueue.c for print
> now_ms/exp/other stuff.
Just after assignment of timeout.tv_sec/nsec, you could do something
like this :
if (timeout.tv_sec < 0 || timeout.tv_nsec < 0 || timeout.tv_nsec >=
1000000000)
fprintf(stderr, "sec=%u nsec=%u\n", (unsigned)timeout.tv_sec,
(unsigned)timeout.tv_nsec);
There's also fddebug() to emit this output on fd -1 so that it can be
debugged with strace/truss instead of polluting the stderr.
> # grep time /usr/local/etc/haproxy.conf | sort | uniq
> timeout client 100s
> timeout client 50000
> timeout client 50000ms
> timeout connect 100s
> timeout connect 5000
> timeout connect 500ms
> timeout queue 100s
> timeout server 100s
> timeout server 50000
> timeout server 2000ms
> timeout server 30s
Again nothing in nature of causing such a timeout to occur :-/
Thanks,
Willy