I'm going to keep with version 1.7.2 till then, so we should have a
comparison

If we think we may have a hang at Tue Apr 18, 9:38, is there any specific
logging we should set up on a server at that time? is it worth setting at
least one server to have nokqueue set at that time?

Thanks

David

On 5 April 2017 at 07:00, Willy Tarreau <[email protected]> wrote:

> Hi all,
>
> On Wed, Apr 05, 2017 at 01:34:20AM +0200, Lukas Tribus wrote:
> > Can we be absolutely positive that those hangs are not directly or
> > indirectly caused by the bugs Willy already fixed in 1.7.4 and 1.7.5, for
> > example from the ML thread "Problems with haproxy 1.7.3 on FreeBSD
> 11.0-p8"?
>
> I don't believe in this at all unfortunately. The issues that were faced
> on FreeBSD in earlier versions were related to connect() occasionally
> succeeding synchronously and haproxy did not handle this case cleanly
> (it initially used to poll then validate the connect() a second time,
> and fixing this broke the rest).
>
> > There maybe multiple and different symptoms of those bugs, so even if the
> > descriptions in those threads don't match your case 100%, it may still
> > caused by the same underlying bug.
> >
> > A confirmation that hose hangs are still happening in v1.7.5 would be
> > crucial.
>
> I'm pretty sure they will still happen.
>
> > The time co-incidence is intriguing, but I would not spend too much time
> > with that. Collecting actual traces (like strace or its freebsd
> equivalent)
> > and capture dumps is more likely to achieve progress, imo.
>
> In fact I do think there's an operating system issue here (and those who
> know me also know that I'm not one who tries to hide haproxy bugs). What
> I suspect is that there's a problem when time wraps. A 1 kHz scheduler
> wraps every 49.7 days. With clocks synchronized over NTP, all of them
> wrap exactly at the same time. If the issue is there, it may happen
> again on Tue Apr 18, 9:38 (13 days from now).
>
> It could have been haproxy's time wrapping and causing the issue, so I
> modified it to add an offset and make the time wrap 5s after startup,
> and couldn't trigger the problem on a FreeBSD system, even after
> multiple attempts. And the time of first crash reported above doesn't
> match any wrapping pattern (0x58b43950). Also, reporters indicated
> that the issue appeared after migrating to FreeBSD 11 and no such
> issue was ever reported on earlier versions.
>
> Also Dave reported this, which is totally abnormal :
>
>       kqueue(0,0,0....) = 22 (EINVAL)
>
> and the fact that the system panicked, which cannot be an haproxy issue.
>
> Another point, Dave reported a loss of network connectivity at the
> same moment when it last happened. Dave, could this be related to
> other FreeBSD nodes running FreeBSD as well and rebooting or any
> such thing ?
>
> I think that at this point we should discuss with some FreeBSD
> maintainers and see what can be done to track this problem down, even
> if it means adding some debugging code in the kqueue loop to help
> troubleshoot this, or using it differently if we're doing something
> wrong.
>
> Given that Mark indicated that reloading the process fixed the problem
> (except he had to manually kill the previous one), one possible workaround
> might be to detect the EINVAL, and try to reinitialize kqueue or switch
> to poll() if this happens (and emit loud warnings in the logs).
>
> > Hoping that this is not AI/IoT/Skynet trying to erase mankind, I wish
> y'all
> > a good night,
>
> There's still a faint possibility of a widespread attack but while I
> can easily imagine some such devices sending a "packet of death"
> exploiting a bug in an OS, I don't believe it would make kqueue()
> return EINVAL in haproxy.
>
> Cheers,
> Willy
>

Reply via email to