I'm going to keep with version 1.7.2 till then, so we should have a comparison
If we think we may have a hang at Tue Apr 18, 9:38, is there any specific logging we should set up on a server at that time? is it worth setting at least one server to have nokqueue set at that time? Thanks David On 5 April 2017 at 07:00, Willy Tarreau <[email protected]> wrote: > Hi all, > > On Wed, Apr 05, 2017 at 01:34:20AM +0200, Lukas Tribus wrote: > > Can we be absolutely positive that those hangs are not directly or > > indirectly caused by the bugs Willy already fixed in 1.7.4 and 1.7.5, for > > example from the ML thread "Problems with haproxy 1.7.3 on FreeBSD > 11.0-p8"? > > I don't believe in this at all unfortunately. The issues that were faced > on FreeBSD in earlier versions were related to connect() occasionally > succeeding synchronously and haproxy did not handle this case cleanly > (it initially used to poll then validate the connect() a second time, > and fixing this broke the rest). > > > There maybe multiple and different symptoms of those bugs, so even if the > > descriptions in those threads don't match your case 100%, it may still > > caused by the same underlying bug. > > > > A confirmation that hose hangs are still happening in v1.7.5 would be > > crucial. > > I'm pretty sure they will still happen. > > > The time co-incidence is intriguing, but I would not spend too much time > > with that. Collecting actual traces (like strace or its freebsd > equivalent) > > and capture dumps is more likely to achieve progress, imo. > > In fact I do think there's an operating system issue here (and those who > know me also know that I'm not one who tries to hide haproxy bugs). What > I suspect is that there's a problem when time wraps. A 1 kHz scheduler > wraps every 49.7 days. With clocks synchronized over NTP, all of them > wrap exactly at the same time. If the issue is there, it may happen > again on Tue Apr 18, 9:38 (13 days from now). > > It could have been haproxy's time wrapping and causing the issue, so I > modified it to add an offset and make the time wrap 5s after startup, > and couldn't trigger the problem on a FreeBSD system, even after > multiple attempts. And the time of first crash reported above doesn't > match any wrapping pattern (0x58b43950). Also, reporters indicated > that the issue appeared after migrating to FreeBSD 11 and no such > issue was ever reported on earlier versions. > > Also Dave reported this, which is totally abnormal : > > kqueue(0,0,0....) = 22 (EINVAL) > > and the fact that the system panicked, which cannot be an haproxy issue. > > Another point, Dave reported a loss of network connectivity at the > same moment when it last happened. Dave, could this be related to > other FreeBSD nodes running FreeBSD as well and rebooting or any > such thing ? > > I think that at this point we should discuss with some FreeBSD > maintainers and see what can be done to track this problem down, even > if it means adding some debugging code in the kqueue loop to help > troubleshoot this, or using it differently if we're doing something > wrong. > > Given that Mark indicated that reloading the process fixed the problem > (except he had to manually kill the previous one), one possible workaround > might be to detect the EINVAL, and try to reinitialize kqueue or switch > to poll() if this happens (and emit loud warnings in the logs). > > > Hoping that this is not AI/IoT/Skynet trying to erase mankind, I wish > y'all > > a good night, > > There's still a faint possibility of a widespread attack but while I > can easily imagine some such devices sending a "packet of death" > exploiting a bug in an OS, I don't believe it would make kqueue() > return EINVAL in haproxy. > > Cheers, > Willy >

