On Tue, Apr 18, 2017 at 03:48:12PM +0200, Willy Tarreau wrote: > On Tue, Apr 18, 2017 at 04:19:44PM +0300, Slawa Olhovchenkov wrote: > > Today I am got this issuse on all 3 haproxy different hosts > > simultaneously (about 07:38:57 UTC I am got 'kernel: sonewconn: pcb > > 0xfffff80040f781d0: Listen queue overflow: 1537 already in queue > > awaiting acceptance (564 occurrences)'). > > OK so you're one of those who got this as well! Cool! (not for you I > know). Till now we only got a few reports from other users who started > to meet the issue after upgrading from FBSD 10 to 11, and all of them > die at the same time. It was 09:38 in our last "prediction" and reported > as such by another person, so I think it was local time probably on UTC+2. > > > 2017-04-18 07:38:57 UTC = 1492501137 unix time. > > 1492501137 unix time = 1640 now_ms, i.e. issues cused around now_ms > > wraps. > > In fact it's -2147482008 ms, 1640 is if you round it to 31 bits, but > anyway I get your point and we see the relation with the wrapping that > we imagined but still couldn't catch. > > > Track down by dtrace show negative timeout.tv_sec/timeout.tv_nsec. > > Nice! All truss outputs we got so far were truncated to 3 args so we > could never see the contents of the timeout field. > > Now the remaining questions are : > - how is it possible to have such a large timeout in the timer queue > - why doesn't the situation fix by itself on the next millisecond when > it's not supposed anymore to be possible to have such a large timeout ;
May be tv_update_date(delta_ms, status) for negative status do nothing. > - why doesn't it affect other pollers What you mean? > - why some users experienced a kernel panic quickly after the issue > > But at least we're leaning towards something so your report is *very* > valuable, thanks a lot. > > > I am don't know how to add loging to ev_kqueue.c for print > > now_ms/exp/other stuff. > > Just after assignment of timeout.tv_sec/nsec, you could do something > like this : > > if (timeout.tv_sec < 0 || timeout.tv_nsec < 0 || timeout.tv_nsec >= > 1000000000) > fprintf(stderr, "sec=%u nsec=%u\n", (unsigned)timeout.tv_sec, > (unsigned)timeout.tv_nsec); > > There's also fddebug() to emit this output on fd -1 so that it can be > debugged with strace/truss instead of polluting the stderr. Not nice logging for production systems. I am expect next trouble around 2-3 month, this is to long and complex for strace/truss collecting. And for 2K RPS on production system. Also, logging only timeout.tv_sec/timeout.tv_nsec don't help you: I am already see this negative values (timeout.tv_sec is about -10700 at 10:38 UTC, timeout.tv_nsec very large negative and changed, don't record, sorry). May be logging some more data can be help? now_ms and exp, for example?

