On Tue, Apr 18, 2017 at 04:19:44PM +0300, Slawa Olhovchenkov wrote:
> Today I am got this issuse on all 3 haproxy different hosts
> simultaneously (about 07:38:57 UTC I am got 'kernel: sonewconn: pcb
> 0xfffff80040f781d0: Listen queue overflow: 1537 already in queue
> awaiting acceptance (564 occurrences)').

OK so you're one of those who got this as well! Cool! (not for you I
know). Till now we only got a few reports from other users who started
to meet the issue after upgrading from FBSD 10 to 11, and all of them
die at the same time. It was 09:38 in our last "prediction" and reported
as such by another person, so I think it was local time probably on UTC+2.

> 2017-04-18 07:38:57 UTC = 1492501137 unix time.
> 1492501137 unix time = 1640 now_ms, i.e. issues cused around now_ms
> wraps.

In fact it's -2147482008 ms, 1640 is if you round it to 31 bits, but
anyway I get your point and we see the relation with the wrapping that
we imagined but still couldn't catch.

> Track down by dtrace show negative timeout.tv_sec/timeout.tv_nsec.

Nice! All truss outputs we got so far were truncated to 3 args so we
could never see the contents of the timeout field.

Now the remaining questions are :
  - how is it possible to have such a large timeout in the timer queue
  - why doesn't the situation fix by itself on the next millisecond when
    it's not supposed anymore to be possible to have such a large timeout ;
  - why doesn't it affect other pollers
  - why some users experienced a kernel panic quickly after the issue

But at least we're leaning towards something so your report is *very*
valuable, thanks a lot.

> I am don't know how to add loging to ev_kqueue.c for print
> now_ms/exp/other stuff.

Just after assignment of timeout.tv_sec/nsec, you could do something
like this :

  if (timeout.tv_sec < 0 || timeout.tv_nsec < 0 || timeout.tv_nsec >= 
1000000000)
     fprintf(stderr, "sec=%u nsec=%u\n", (unsigned)timeout.tv_sec, 
(unsigned)timeout.tv_nsec);

There's also fddebug() to emit this output on fd -1 so that it can be
debugged with strace/truss instead of polluting the stderr.

> # grep time /usr/local/etc/haproxy.conf | sort | uniq
>         timeout client      100s
>         timeout client  50000
>         timeout client 50000ms
>         timeout connect      100s
>         timeout connect 5000
>         timeout connect 500ms
>         timeout queue   100s
>         timeout server      100s
>         timeout server  50000
>         timeout server 2000ms
>         timeout server 30s

Again nothing in nature of causing such a timeout to occur :-/

Thanks,
Willy


Reply via email to