On Tue, Apr 18, 2017 at 04:19:44PM +0300, Slawa Olhovchenkov wrote: > Today I am got this issuse on all 3 haproxy different hosts > simultaneously (about 07:38:57 UTC I am got 'kernel: sonewconn: pcb > 0xfffff80040f781d0: Listen queue overflow: 1537 already in queue > awaiting acceptance (564 occurrences)').
OK so you're one of those who got this as well! Cool! (not for you I know). Till now we only got a few reports from other users who started to meet the issue after upgrading from FBSD 10 to 11, and all of them die at the same time. It was 09:38 in our last "prediction" and reported as such by another person, so I think it was local time probably on UTC+2. > 2017-04-18 07:38:57 UTC = 1492501137 unix time. > 1492501137 unix time = 1640 now_ms, i.e. issues cused around now_ms > wraps. In fact it's -2147482008 ms, 1640 is if you round it to 31 bits, but anyway I get your point and we see the relation with the wrapping that we imagined but still couldn't catch. > Track down by dtrace show negative timeout.tv_sec/timeout.tv_nsec. Nice! All truss outputs we got so far were truncated to 3 args so we could never see the contents of the timeout field. Now the remaining questions are : - how is it possible to have such a large timeout in the timer queue - why doesn't the situation fix by itself on the next millisecond when it's not supposed anymore to be possible to have such a large timeout ; - why doesn't it affect other pollers - why some users experienced a kernel panic quickly after the issue But at least we're leaning towards something so your report is *very* valuable, thanks a lot. > I am don't know how to add loging to ev_kqueue.c for print > now_ms/exp/other stuff. Just after assignment of timeout.tv_sec/nsec, you could do something like this : if (timeout.tv_sec < 0 || timeout.tv_nsec < 0 || timeout.tv_nsec >= 1000000000) fprintf(stderr, "sec=%u nsec=%u\n", (unsigned)timeout.tv_sec, (unsigned)timeout.tv_nsec); There's also fddebug() to emit this output on fd -1 so that it can be debugged with strace/truss instead of polluting the stderr. > # grep time /usr/local/etc/haproxy.conf | sort | uniq > timeout client 100s > timeout client 50000 > timeout client 50000ms > timeout connect 100s > timeout connect 5000 > timeout connect 500ms > timeout queue 100s > timeout server 100s > timeout server 50000 > timeout server 2000ms > timeout server 30s Again nothing in nature of causing such a timeout to occur :-/ Thanks, Willy