Hi all,
On Wed, Apr 05, 2017 at 01:34:20AM +0200, Lukas Tribus wrote:
> Can we be absolutely positive that those hangs are not directly or
> indirectly caused by the bugs Willy already fixed in 1.7.4 and 1.7.5, for
> example from the ML thread "Problems with haproxy 1.7.3 on FreeBSD 11.0-p8"?
I don't believe in this at all unfortunately. The issues that were faced
on FreeBSD in earlier versions were related to connect() occasionally
succeeding synchronously and haproxy did not handle this case cleanly
(it initially used to poll then validate the connect() a second time,
and fixing this broke the rest).
> There maybe multiple and different symptoms of those bugs, so even if the
> descriptions in those threads don't match your case 100%, it may still
> caused by the same underlying bug.
>
> A confirmation that hose hangs are still happening in v1.7.5 would be
> crucial.
I'm pretty sure they will still happen.
> The time co-incidence is intriguing, but I would not spend too much time
> with that. Collecting actual traces (like strace or its freebsd equivalent)
> and capture dumps is more likely to achieve progress, imo.
In fact I do think there's an operating system issue here (and those who
know me also know that I'm not one who tries to hide haproxy bugs). What
I suspect is that there's a problem when time wraps. A 1 kHz scheduler
wraps every 49.7 days. With clocks synchronized over NTP, all of them
wrap exactly at the same time. If the issue is there, it may happen
again on Tue Apr 18, 9:38 (13 days from now).
It could have been haproxy's time wrapping and causing the issue, so I
modified it to add an offset and make the time wrap 5s after startup,
and couldn't trigger the problem on a FreeBSD system, even after
multiple attempts. And the time of first crash reported above doesn't
match any wrapping pattern (0x58b43950). Also, reporters indicated
that the issue appeared after migrating to FreeBSD 11 and no such
issue was ever reported on earlier versions.
Also Dave reported this, which is totally abnormal :
kqueue(0,0,0....) = 22 (EINVAL)
and the fact that the system panicked, which cannot be an haproxy issue.
Another point, Dave reported a loss of network connectivity at the
same moment when it last happened. Dave, could this be related to
other FreeBSD nodes running FreeBSD as well and rebooting or any
such thing ?
I think that at this point we should discuss with some FreeBSD
maintainers and see what can be done to track this problem down, even
if it means adding some debugging code in the kqueue loop to help
troubleshoot this, or using it differently if we're doing something
wrong.
Given that Mark indicated that reloading the process fixed the problem
(except he had to manually kill the previous one), one possible workaround
might be to detect the EINVAL, and try to reinitialize kqueue or switch
to poll() if this happens (and emit loud warnings in the logs).
> Hoping that this is not AI/IoT/Skynet trying to erase mankind, I wish y'all
> a good night,
There's still a faint possibility of a widespread attack but while I
can easily imagine some such devices sending a "packet of death"
exploiting a bug in an OS, I don't believe it would make kqueue()
return EINVAL in haproxy.
Cheers,
Willy